KmeansClustering
Versions
v2.0.0
Basic Information
Class Name: KmeansClustering
Title: K-means Clustering
Version: 2.0.0
Author: Joe Jenkins
Organization: OneStream
Creation Date: 2024-10-28
Default Routine Memory Capacity: 2.0 GB
Tags
Clustering, Unsupervised, Model, Data Transformation, Data Analysis, Data Visualization
Description
Short Description
Assigns clusters using the K-means clustering algorithm.
Long Description
This routine performs K-means clustering on a dataset, assigns a cluster to each data point, and returns a summary with results and visualizations. The K-means clustering algorithm is an unsupervised machine learning method that partitions a dataset into a predefined number of clusters. Using K-means can help identify patterns in the data and group similar data points, potentially providing more insights than analyzing unclustered data. The algorithm can handle both date and text inputs, which the user can specify within the routine. If the dataset has more than three dimensions, a 3D Principal Component Analysis (PCA) plot can be enabled to illustrate the relationships between clusters in greater detail. At the end of the routine, the user will receive a summary of the clustered data along with visualizations that help interpret the clusters and the patterns identified by the algorithm. The original dataset will also be returned with an additional column containing the cluster number for each data point.
Use Cases
1. National Retailer Customer Segmentation
A large national retailer has various information on their customers, such as products purchased, product category, price, and date of purchase. K-means clustering can be used to identify similar customers based on their previous purchase history and behavior. After applying the K-means clustering technique, patterns in the various clusters might be identified, such as high-spenders, occasional buyers, and budget-driven customers. By clustering customers into groups, retailers can tailor marketing strategies and create personalized offers for their various customer segments.
2. Product Reviews
Due to the K-means clustering algorithm's ability to handle text, it can help companies analyze customer reviews and feedback. For instance, an online retailer might have data on customer product reviews, such as the review title, text, and a rating from 0-5. By applying K-means clustering to this data, the retailer can segment the reviews and identify patterns within the feedback. It may uncover multiple underlying themes within both positive and negative reviews, allowing the retailer to understand what customers like and dislike about their products at a more granular level than the star rating alone.
3. Sales Territory
A company with a large sales force can use K-means clustering to group geographic regions based on factors like customer density, potential sales value, or buying behavior. This method optimizes sales territory assignments, ensuring each salesperson is responsible for a strategically profitable area. By clustering similar regions, the company prevents overlapping territories, balances workloads across the team, and enhances efficiency, leading to better sales performance and more effective resource allocation.
Routine Methods
1. Init (Constructor)
- Method:
__init__-
Type: Constructor
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: N/A
-
Outputs Dynamic Artifacts: No
-
Short Description:
- K-means Clustering Algorithm
-
Detailed Description:
- The constructor for the K-means Clustering Algorithm routine. This method initializes the routine with the parameters provided by the user.
-
Inputs:
- Required Input
- Normalize the Data: Whether to normalize the data before running the K-means algorithm. Defaults to False.
- Name:
normalize_data - Tooltip:
- Detail:
- Depending on the size of the dataset, setting this to True might increase run time. The data will be returned in it's original form with an added column that includes the cluster number for each data point.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: bool
- Name:
- Deterministic: Setting this to True, removes randomness in the model. Defaults to True.
- Name:
deterministic_model - Tooltip:
- Detail:
- Deterministic models use a fixed seed for random number generation, which makes results reproducible. Meanwhile, non-deterministic models use a random seed for random number generation.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: bool
- Name:
- Normalize the Data: Whether to normalize the data before running the K-means algorithm. Defaults to False.
- Required Input
-
Artifacts: No artifacts are returned by this method
-
2. Fit (Method)
- Method:
fit-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: N/A
-
Outputs Dynamic Artifacts: No
-
Short Description:
- K-means Clustering Algorithm Fit
-
Detailed Description:
- This fits the data to the K-means clustering algorithm. The routine returns the original data with an additional column containing the cluster number assigned by the K-means clustering algorithm. In addition, the routine provides a report that describes the clustering results. By default, this report includes an elbow plot and a bar. The report can include a 3D Principal Component Analysis (PCA) graph to provide the user additional insights into the clusters if applicable and enabled.
-
Inputs:
- Required Input
- Source Data Definition: The source data definition.
- Name:
source_data_connection - Tooltip:
- Detail:
- Click on the drop down to specify your dataset source.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Date Column(s): The column(s) containing the date in the dataset. Date values should include a year, month, and day. The default value is None.
- Name:
date_column - Tooltip:
- Detail:
- Specify the date column if applicable in the dataset. If there is not a date column, leave this field blank.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: list[str]
- Name:
- Text Corpus Column(s): The column(s) that contain the text corpus in the dataset. The default value is None.
- Name:
text_column - Tooltip:
- Detail:
- Include only columns containing text corpus fields from the dataset, if applicable. Leave this field blank if there are no text corpus columns. Exclude columns with categorical factors, and only add columns with large text-based values that need text preprocessing.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: list[str]
- Name:
- Column(s) to Ignore in the Analysis: The column(s) to ignore in the K-means clustering algorithm. Examples include IDs and large categorical features that are sparsely populated and add little information to the model. Defaults to None.
- Name:
ignore_columns - Tooltip:
- Detail:
- K-means clustering can be sensitive to large numbers of dummy variables. Exclude columns from the dataset, such as IDs or categorical features with many levels, that are not needed for K-means clustering. Leave this field blank if all columns are required.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: list[str]
- Name:
- Include a 3D PCA Plot: Whether to include a 3D Principal Component Analysis (PCA) plot in the final report. This is recommended if the data has more than three dimensions. Defaults to True.
- Name:
include_pca_plots - Tooltip:
- Detail:
- By default, the final report will include a 2D scatter plot if the data is two-dimensional, or a 2D PCA plot and a 3D scatter plot if the data is three-dimensional. The dimensionality of the data is determined after preprocessing. For example, passing two numeric columns will result in a two-dimensional dataset, while passing a numeric column and a categorical column may result in a higher-dimensional dataset.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: bool
- Name:
- Source Data Definition: The source data definition.
- Optional Input
- ID Column: Optional unique ID column. This column will be excluded from modeling, validated (non-null, int-castable, unique), and merged back into outputs to preserve record identity.
- Name:
id_column - Tooltip:
- Detail:
- Pick a single column that uniquely identifies each row. Leave blank if you don't have one.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[str]
- Name:
- Number of K-means Clusters: The number of clusters to use in the K-means algorithm. By default, the number of clusters is set to None, so the algorithm will find the optimal number of clusters.
- Name:
number_kmeans_clusters - Tooltip:
- Detail:
- By default, the model will determine the optimal number of clusters for the K-means algorithm. However, if the number of clusters is known or needs to be overridden, enable this field to specify the desired number of clusters.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[int]
- Name:
- ID Column: Optional unique ID column. This column will be excluded from modeling, validated (non-null, int-castable, unique), and merged back into outputs to preserve record identity.
- Required Input
-
Artifacts:
-
Clustered Data: The original dataset with an additional column containing the assigned clusters from the K-means clustering algorithm.
- Qualified Key Annotation:
clustered_data - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@clustered_data/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
Cluster Report: A report on the results of K-means clustering. The report includes an elbow plot, a distribution of the clusters, a 2D scatter or PCA plot based on the number of dimensions in the data, a 3D scatter plot if the data is three-dimensional, or an optional 3D PCA plot if the data has more than three dimensions.
- Qualified Key Annotation:
cluster_report - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cluster_report/data_/html_content.html- The html content.
- Qualified Key Annotation:
-
-
3. Predict (Method)
- Method:
predict-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: N/A
-
Outputs Dynamic Artifacts: No
-
Short Description:
- K-means Clustering Algorithm Predict
-
Detailed Description:
- Utilizes the previously fitted model to assign clusters to the new data. The input data must contain the same columns as the data used to fit the K-means clustering algorithm. The routine makes predictions for the value based on the dimension columns inputted. The method returns a dataframe with the assigned clusters and a report on the results of the K-means clustering.
-
Inputs:
- Required Input
- Source Data Definition: The source data definition.
- Name:
source_data_connection - Tooltip:
- Detail:
- Click on the drop down to specify your dataset source.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Date Column(s): The column(s) containing the date in the dataset. Date values should include a year, month, and day. The default value is None.
- Name:
date_column - Tooltip:
- Detail:
- Specify the date column if applicable in the dataset. If there is not a date column, leave this field blank.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: list[str]
- Name:
- Text Corpus Column(s): The column(s) that contain the text corpus in the dataset. The default value is None.
- Name:
text_column - Tooltip:
- Detail:
- Include only columns containing text corpus fields from the dataset, if applicable. Leave this field blank if there are no text corpus columns. Exclude columns with categorical factors, and only add columns with large text-based values that need text preprocessing.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: list[str]
- Name:
- Column(s) to Ignore in the Analysis: The column(s) to ignore in the K-means clustering algorithm. Examples include IDs and large categorical features that are sparsely populated and add little information to the model. Defaults to None.
- Name:
ignore_columns - Tooltip:
- Detail:
- K-means clustering can be sensitive to large numbers of dummy variables. Exclude columns from the dataset, such as IDs or categorical features with many levels, that are not needed for K-means clustering. Leave this field blank if all columns are required.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: list[str]
- Name:
- Include a 3D PCA Plot: Whether to include a 3D Principal Component Analysis (PCA) plot in the final report. This is recommended if the data has more than three dimensions. Defaults to True.
- Name:
include_pca_plots - Tooltip:
- Detail:
- By default, the final report will include a 2D scatter plot if the data is two-dimensional, or a 2D PCA plot and a 3D scatter plot if the data is three-dimensional. The dimensionality of the data is determined after preprocessing. For example, passing two numeric columns will result in a two-dimensional dataset, while passing a numeric column and a categorical column may result in a higher-dimensional dataset.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: bool
- Name:
- Source Data Definition: The source data definition.
- Optional Input
- ID Column: Optional unique ID column in the prediction dataset. Must be non-null, int-castable, and unique. It is excluded from modeling and merged back into outputs.
- Name:
id_column - Tooltip:
- Detail:
- Pick a single column that uniquely identifies each row. Leave blank if you don't have one.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[str]
- Name:
- ID Column: Optional unique ID column in the prediction dataset. Must be non-null, int-castable, and unique. It is excluded from modeling and merged back into outputs.
- Required Input
-
Artifacts:
-
Predicted Data: The uploaded dataset with an additional column containing the assigned clusters from the pre-fitted K-means clustering algorithm.
- Qualified Key Annotation:
predicted_clusters - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@predicted_clusters/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
Predicted Report: A report on the results of K-means clustering. The report includes an elbow plot, a distribution of the clusters, a 2D scatter or PCA plot based on the number of dimensions in the data, a 3D scatter plot if the data is three-dimensional, or an optional 3D PCA plot if the data has more than three dimensions.
- Qualified Key Annotation:
predicted_report - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@predicted_report/data_/html_content.html- The html content.
- Qualified Key Annotation:
-
-
Interface Definitions
No interface definitions found for this routine
Developer Docs
Routine Typename: KmeansClustering
| Method Name | Artifact Keys |
|---|---|
__init__ | N/A |
fit | clustered_data, cluster_report |
predict | predicted_clusters, predicted_report |