KmeansClustering

Versions

2.0.0

v2.0.0

Basic Information

Class Name: KmeansClustering

Title: K-means Clustering

Version: 2.0.0

Author: Joe Jenkins

Organization: OneStream

Creation Date: 2024-10-28

Default Routine Memory Capacity: 2.0 GB

Description

Short Description

Assigns clusters using the K-means clustering algorithm.

Long Description

This routine performs K-means clustering on a dataset, assigns a cluster to each data point, and returns a summary with results and visualizations. The K-means clustering algorithm is an unsupervised machine learning method that partitions a dataset into a predefined number of clusters. Using K-means can help identify patterns in the data and group similar data points, potentially providing more insights than analyzing unclustered data. The algorithm can handle both date and text inputs, which the user can specify within the routine. If the dataset has more than three dimensions, a 3D Principal Component Analysis (PCA) plot can be enabled to illustrate the relationships between clusters in greater detail. At the end of the routine, the user will receive a summary of the clustered data along with visualizations that help interpret the clusters and the patterns identified by the algorithm. The original dataset will also be returned with an additional column containing the cluster number for each data point.

Use Cases

1. National Retailer Customer Segmentation

A large national retailer has various information on their customers, such as products purchased, product category, price, and date of purchase. K-means clustering can be used to identify similar customers based on their previous purchase history and behavior. After applying the K-means clustering technique, patterns in the various clusters might be identified, such as high-spenders, occasional buyers, and budget-driven customers. By clustering customers into groups, retailers can tailor marketing strategies and create personalized offers for their various customer segments.

2. Product Reviews

Due to the K-means clustering algorithm's ability to handle text, it can help companies analyze customer reviews and feedback. For instance, an online retailer might have data on customer product reviews, such as the review title, text, and a rating from 0-5. By applying K-means clustering to this data, the retailer can segment the reviews and identify patterns within the feedback. It may uncover multiple underlying themes within both positive and negative reviews, allowing the retailer to understand what customers like and dislike about their products at a more granular level than the star rating alone.

3. Sales Territory

A company with a large sales force can use K-means clustering to group geographic regions based on factors like customer density, potential sales value, or buying behavior. This method optimizes sales territory assignments, ensuring each salesperson is responsible for a strategically profitable area. By clustering similar regions, the company prevents overlapping territories, balances workloads across the team, and enhances efficiency, leading to better sales performance and more effective resource allocation.

Routine Methods

1. Init (Constructor)

Method: __init__
- Type: Constructor
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: N/A
- Outputs Dynamic Artifacts: No
- Short Description:
  - K-means Clustering Algorithm
- Detailed Description:
  - The constructor for the K-means Clustering Algorithm routine. This method initializes the routine with the parameters provided by the user.
- Inputs:
  - Required Input
    - Normalize the Data: Whether to normalize the data before running the K-means algorithm. Defaults to False.
      - Name: normalize_data
      - Tooltip:
        
        Detail:
        
        Depending on the size of the dataset, setting this to True might increase run time. The data will be returned in it's original form with an added column that includes the cluster number for each data point.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: bool
    - Deterministic: Setting this to True, removes randomness in the model. Defaults to True.
      - Name: deterministic_model
      - Tooltip:
        
        Detail:
        
        Deterministic models use a fixed seed for random number generation, which makes results reproducible. Meanwhile, non-deterministic models use a random seed for random number generation.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: bool
- Artifacts: No artifacts are returned by this method

2. Fit (Method)

Method: fit
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: N/A
- Outputs Dynamic Artifacts: No
- Short Description:
  - K-means Clustering Algorithm Fit
- Detailed Description:
  - This fits the data to the K-means clustering algorithm. The routine returns the original data with an additional column containing the cluster number assigned by the K-means clustering algorithm. In addition, the routine provides a report that describes the clustering results. By default, this report includes an elbow plot and a bar. The report can include a 3D Principal Component Analysis (PCA) graph to provide the user additional insights into the clusters if applicable and enabled.
- Inputs:
  - Required Input
    - Source Data Definition: The source data definition.
      - Name: source_data_connection
      - Tooltip:
        
        Detail:
        
        Click on the drop down to specify your dataset source.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Date Column(s): The column(s) containing the date in the dataset. Date values should include a year, month, and day. The default value is None.
      - Name: date_column
      - Tooltip:
        
        Detail:
        
        Specify the date column if applicable in the dataset. If there is not a date column, leave this field blank.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Text Corpus Column(s): The column(s) that contain the text corpus in the dataset. The default value is None.
      - Name: text_column
      - Tooltip:
        
        Detail:
        
        Include only columns containing text corpus fields from the dataset, if applicable. Leave this field blank if there are no text corpus columns. Exclude columns with categorical factors, and only add columns with large text-based values that need text preprocessing.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Column(s) to Ignore in the Analysis: The column(s) to ignore in the K-means clustering algorithm. Examples include IDs and large categorical features that are sparsely populated and add little information to the model. Defaults to None.
      - Name: ignore_columns
      - Tooltip:
        
        Detail:
        
        K-means clustering can be sensitive to large numbers of dummy variables. Exclude columns from the dataset, such as IDs or categorical features with many levels, that are not needed for K-means clustering. Leave this field blank if all columns are required.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Include a 3D PCA Plot: Whether to include a 3D Principal Component Analysis (PCA) plot in the final report. This is recommended if the data has more than three dimensions. Defaults to True.
      - Name: include_pca_plots
      - Tooltip:
        
        Detail:
        
        By default, the final report will include a 2D scatter plot if the data is two-dimensional, or a 2D PCA plot and a 3D scatter plot if the data is three-dimensional. The dimensionality of the data is determined after preprocessing. For example, passing two numeric columns will result in a two-dimensional dataset, while passing a numeric column and a categorical column may result in a higher-dimensional dataset.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: bool
  - Optional Input
    - ID Column: Optional unique ID column. This column will be excluded from modeling, validated (non-null, int-castable, unique), and merged back into outputs to preserve record identity.
      - Name: id_column
      - Tooltip:
        
        Detail:
        
        Pick a single column that uniquely identifies each row. Leave blank if you don't have one.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Optional[str]
    - Number of K-means Clusters: The number of clusters to use in the K-means algorithm. By default, the number of clusters is set to None, so the algorithm will find the optimal number of clusters.
      - Name: number_kmeans_clusters
      - Tooltip:
        
        Detail:
        
        By default, the model will determine the optimal number of clusters for the K-means algorithm. However, if the number of clusters is known or needs to be overridden, enable this field to specify the desired number of clusters.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Optional[int]
- Artifacts:
  - Clustered Data: The original dataset with an additional column containing the assigned clusters from the K-means clustering algorithm.
    - Qualified Key Annotation: clustered_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@clustered_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Cluster Report: A report on the results of K-means clustering. The report includes an elbow plot, a distribution of the clusters, a 2D scatter or PCA plot based on the number of dimensions in the data, a 3D scatter plot if the data is three-dimensional, or an optional 3D PCA plot if the data has more than three dimensions.
    - Qualified Key Annotation: cluster_report
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cluster_report/data_/html_content.html
        
        The html content.

3. Predict (Method)

Method: predict
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: N/A
- Outputs Dynamic Artifacts: No
- Short Description:
  - K-means Clustering Algorithm Predict
- Detailed Description:
  - Utilizes the previously fitted model to assign clusters to the new data. The input data must contain the same columns as the data used to fit the K-means clustering algorithm. The routine makes predictions for the value based on the dimension columns inputted. The method returns a dataframe with the assigned clusters and a report on the results of the K-means clustering.
- Inputs:
  - Required Input
    - Source Data Definition: The source data definition.
      - Name: source_data_connection
      - Tooltip:
        
        Detail:
        
        Click on the drop down to specify your dataset source.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Date Column(s): The column(s) containing the date in the dataset. Date values should include a year, month, and day. The default value is None.
      - Name: date_column
      - Tooltip:
        
        Detail:
        
        Specify the date column if applicable in the dataset. If there is not a date column, leave this field blank.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Text Corpus Column(s): The column(s) that contain the text corpus in the dataset. The default value is None.
      - Name: text_column
      - Tooltip:
        
        Detail:
        
        Include only columns containing text corpus fields from the dataset, if applicable. Leave this field blank if there are no text corpus columns. Exclude columns with categorical factors, and only add columns with large text-based values that need text preprocessing.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Column(s) to Ignore in the Analysis: The column(s) to ignore in the K-means clustering algorithm. Examples include IDs and large categorical features that are sparsely populated and add little information to the model. Defaults to None.
      - Name: ignore_columns
      - Tooltip:
        
        Detail:
        
        K-means clustering can be sensitive to large numbers of dummy variables. Exclude columns from the dataset, such as IDs or categorical features with many levels, that are not needed for K-means clustering. Leave this field blank if all columns are required.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Include a 3D PCA Plot: Whether to include a 3D Principal Component Analysis (PCA) plot in the final report. This is recommended if the data has more than three dimensions. Defaults to True.
      - Name: include_pca_plots
      - Tooltip:
        
        Detail:
        
        By default, the final report will include a 2D scatter plot if the data is two-dimensional, or a 2D PCA plot and a 3D scatter plot if the data is three-dimensional. The dimensionality of the data is determined after preprocessing. For example, passing two numeric columns will result in a two-dimensional dataset, while passing a numeric column and a categorical column may result in a higher-dimensional dataset.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: bool
  - Optional Input
    - ID Column: Optional unique ID column in the prediction dataset. Must be non-null, int-castable, and unique. It is excluded from modeling and merged back into outputs.
      - Name: id_column
      - Tooltip:
        
        Detail:
        
        Pick a single column that uniquely identifies each row. Leave blank if you don't have one.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Optional[str]
- Artifacts:
  - Predicted Data: The uploaded dataset with an additional column containing the assigned clusters from the pre-fitted K-means clustering algorithm.
    - Qualified Key Annotation: predicted_clusters
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@predicted_clusters/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Predicted Report: A report on the results of K-means clustering. The report includes an elbow plot, a distribution of the clusters, a 2D scatter or PCA plot based on the number of dimensions in the data, a 3D scatter plot if the data is three-dimensional, or an optional 3D PCA plot if the data has more than three dimensions.
    - Qualified Key Annotation: predicted_report
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@predicted_report/data_/html_content.html
        
        The html content.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: KmeansClustering

Method Name	Artifact Keys
`__init__`	N/A
`fit`	clustered_data, cluster_report
`predict`	predicted_clusters, predicted_report

Versions​

v2.0.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. National Retailer Customer Segmentation​

2. Product Reviews​

3. Sales Territory​

Routine Methods​

1. Init (Constructor)​

2. Fit (Method)​

3. Predict (Method)​

Interface Definitions​

Developer Docs​