KMeansClusteringAnalysis

Versions

0.1.0

v0.1.0

Basic Information

Class Name: KMeansClusteringAnalysis

Title: Advanced K-Means

Version: 0.1.0

Author: Clustering Analytics Team

Organization: OneStream

Creation Date: 2025-08-18

Default Routine Memory Capacity: 2 GB

Description

Short Description

Cluster data points into distinct groups based on feature similarity using K-Means++ clustering algorithm.

Long Description

This routine performs K-Means++ clustering on your dataset. Unlike standard K-Means, which randomly selects initial cluster centroids, K-Means++ uses a smarter initialization strategy that probabilistically selects initial centroids that are far apart from each other. This intelligent initialization leads to faster convergence, more consistent clustering results, and reduced sensitivity to the initial centroid placement. It helps identify groupings within your data based on feature similarity, enabling deeper insights and informed decision-making. You can customize the number of clusters, the clustering dimensions, feature dimensions and weighting, and the clustering algorithm used. This offers a flexible and powerful way to analyze your data.

Use Cases

1. Performance-Based Clustering

A customer may choose to use this routine to cluster their data based on performance metrics such as sales figures, customer satisfaction scores, or operational efficiency indicators. By grouping similar performance profiles, the customer can identify high-performing segments, target underperforming areas for improvement, and optimize operational strategies. The resulting clusters can, for example, be used to find similarities amongst the high-performing or low-performing clusters to draw insights into what factors contribute to success or failure.

2. Attribute-Based Clustering

A customer may choose to use this routine to cluster their data based on specific attributes such as demographics, product features, or customer behaviors. By grouping similar attributes, the customer can identify patterns and trends within their data, enabling targeted marketing strategies, product development, and customer segmentation efforts. The resulting clusters can, for example, be used to find similarities amongst the different attribute clusters to draw insights into what attributes are most common in each cluster and how they may be causally related to performance outcomes.

Routine Methods

1. Init (Constructor)

Method: __init__
- Type: Constructor
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: Yes
- Read Only: No
- Method Limits: N/A
- Outputs Dynamic Artifacts: No
- Short Description:
  - Initializes the ClusteringAnalysis routine with the provided API and parameters.
- Detailed Description:
  - This constructor sets up this instance of the clustering analysis routine. This is where the user will essentially be inputting their configured data source for the lifetime of the routine. This will specify clustering dimensions, feature dimensions, etc. from a single data source that cannot be changed during the lifetime of the routine. If the user wants to change the data source, they will need to create a new instance of the routine.
- Inputs:
  - Required Input
    - Deterministic Model Configuration: Whether or not to use deterministic clustering algorithm for this analysis.
      - Name: deterministic_model
      - Tooltip:
        
        Detail:
        
        Please define if the clustering algorithm is to be deterministic for this analysis.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: bool
    - KMeans Hyperparameters: The number of clusters to use for this analysis.
      - Name: n_clusters
      - Tooltip:
        
        Detail:
        
        Please select the number of clusters to use for this analysis.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
    - Model Configuration: The name of the clustering algorithm to use for this analysis.
      - Name: clustering_algorithm_name
      - Tooltip:
        
        Detail:
        
        The clustering algorithm to use for this analysis.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Literal
- Artifacts: No artifacts are returned by this method

2. Fit (Method)

Method: fit
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: During scale testing this method performed with datasets up to 900,000 rows and 10 feature columns without issues. Larger datasets may cause a timeout error depending on system resources and execution environment.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Fits the clustering analysis model to the provided parameters.
- Detailed Description:
  - This method will take the parameters provided by the user and fit the clustering analysis model to them. This will include clustering dimensions, feature dimensions, etc. The user can specify the number of clusters, the clustering algorithm, and the feature weighting method to use for the analysis.
- Inputs:
  - Required Input
    - Clustering Data Input: The data input configuration for the clustering analysis.
      - Name: clustering_data_input
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Clustering Data Configuration
      - Nested Model: Clustering Data Configuration
        
        Required Input
        
        Source Data Definition: Source Data Definition.
        
        Name: source_data_definition
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be an instance of Tabular Connection
        
        Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Clustering Dimensions: The unique combination of column values that define the “entity” that you are trying to compare to others.
        
        Name: clustering_dimensions
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
        
        Clustering Scope: The subset of clustering dimensions that cannot be compared across. Any unique entry in this column will run it's own clustering model.
        
        Name: clustering_scope
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
        
        Feature Columns: Columns that you want to use to calculate the cluster segments.
        
        Name: feature_columns
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
- Artifacts:
  - Clustering Intersection Results: Parquet file containing data about the clustering intersections and which cluster they belong to.
    - Qualified Key Annotation: cluster_intersection
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cluster_intersection/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Clustering Descriptions: Parquet file containing data about the clusters created by the clustering fit method.
    - Qualified Key Annotation: cluster_descriptions
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cluster_descriptions/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Data Utilized: Parquet file containing the data utilized in the clustering fit method.
    - Qualified Key Annotation: data_utilized
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@data_utilized/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Evaluation Metric: Numeric evaluation metric produced during clustering model fitting. This may represent a cluster quality score or model selection criterion depending on the algorithm used (e.g., silhouette score for KMeans, BIC for Gaussian Mixture Models, DBCV for HDBSCAN).
    - Qualified Key Annotation: evaluation_metric
    - Aggregate Artifact: False
    - In-Memory Json Accessible: True
    - File Annotations:
      - artifacts_/@evaluation_metric/data_/float.txt
        
        A text file holding float number.

3. Predict (Method)

Method: predict
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: During scale testing this method performed with datasets up to 3,000,000 rows and 10 feature columns without issues. Larger datasets may cause a timeout error depending on system resources and execution environment.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Assigns clusters to new data based on the fitted clustering analysis model.
- Detailed Description:
  - This method will take the parameters provided by the user and assign clusters to new data based on the fitted clustering analysis model. The user must provide a data source that contains the same clustering dimensions and feature dimensions as the data used to fit the model.
- Inputs:
  - Required Input
    - Prediction Datasource: Select the datasource containing observations to assign to clusters.
      - Name: datasource
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
- Artifacts:
  - Clustering Intersection Results: Parquet file containing data about the clustering intersections and which cluster they belong to.
    - Qualified Key Annotation: cluster_intersection
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cluster_intersection/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Data Utilized: Parquet file containing the data utilized in the clustering predict method.
    - Qualified Key Annotation: data_utilized
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@data_utilized/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

1. Clustering Analysis Interface

An interface class requiring fit and predict methods to be implemented.

This BaseRoutineInterface class enforces a common interface for all clustering routines. The interface requires each clustering routine to implement a fit method and a predict method with the same input parameters. Each concrete class will have constructor methods where hyperparameters specific to the clustering algorithm may be set, however, this interface does not enforce any specific constructor method.

Interface Methods:

1. Fit

Method Name: fit

Short Description: Abstract Fit Method

Detailed Description: This specifies the necessary input and output parameters for the fit method on all anomaly detection routines. The input parameters contain a source data definition and time range to fit an anomaly detector to.

Inputs:

Property	Type	Required	Description
`clustering_data_input`	`#/$defs/ClusteringDataInput`	Yes	The data input configuration for the clustering analysis.

Input Schema (JSON):

{
  "$defs": {
    "ClusteringDataInput": {
      "properties": {
        "source_data_definition": {
          "$ref": "#/$defs/TabularConnection",
          "description": "Source Data Definition",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "SourceDataDefinition",
          "title": "Source Data Definition",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "clustering_dimensions": {
          "description": "The unique combination of column values that define the \u201centity\u201d that you are trying to compare to others.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "items": {
            "type": "string"
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.store.clustering_analysis.clustering.pbm.clustering_pbms:ClusteringDataInput.get_dimension_options",
          "options_callback_kwargs": null,
          "state_name": "ClusteringDimensions",
          "title": "Clustering Dimensions",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "array"
        },
        "clustering_scope": {
          "default": [],
          "description": "The subset of clustering dimensions that cannot be compared across. Any unique entry in this column will run it's own clustering model.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "items": {
            "type": "string"
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.store.clustering_analysis.clustering.pbm.clustering_pbms:ClusteringDataInput.get_scope_options",
          "options_callback_kwargs": null,
          "state_name": "ClusteringScope",
          "title": "Clustering Scope",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "array"
        },
        "feature_columns": {
          "description": "Columns that you want to use to calculate the cluster segments",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "items": {
            "type": "string"
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.store.clustering_analysis.clustering.pbm.clustering_pbms:ClusteringDataInput.get_feature_options",
          "options_callback_kwargs": null,
          "state_name": "FeatureColumns",
          "title": "Feature Columns",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "array"
        }
      },
      "required": [
        "source_data_definition",
        "clustering_dimensions",
        "feature_columns"
      ],
      "title": "ClusteringDataInput",
      "type": "object"
    },
    "FileExtensions_": {
      "description": "File Extensions.",
      "enum": [
        ".csv",
        ".tsv",
        ".psv",
        ".parquet",
        ".xlsx"
      ],
      "title": "FileExtensions_",
      "type": "string"
    },
    "FileTabularConnection": {
      "properties": {
        "connection_key": {
          "$ref": "#/$defs/MetaFileSystemConnectionKey",
          "description": "The MetaFileSystem connection key.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection_key",
          "title": "Connection Key",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "file_path": {
          "description": "The full file path to the file to ingest.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.filetable:FileTabularConnection.get_file_path_bound_options",
          "options_callback_kwargs": null,
          "state_name": "file_path",
          "title": "File Path",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "connection_key",
        "file_path"
      ],
      "title": "FileTabularConnection",
      "type": "object"
    },
    "MetaFileSystemConnectionKey": {
      "enum": [
        "sql-server-routine",
        "sql-server-shared"
      ],
      "title": "MetaFileSystemConnectionKey",
      "type": "string"
    },
    "PartitionedFileTabularConnection": {
      "properties": {
        "connection_key": {
          "$ref": "#/$defs/MetaFileSystemConnectionKey",
          "description": "The MetaFileSystem connection key.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection_key",
          "title": "Connection Key",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "file_type": {
          "$ref": "#/$defs/FileExtensions_",
          "description": "The type of files to read from the directory.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "file_info",
          "title": "File Type",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "directory_path": {
          "description": "The full directory path containing partitioned tabular files.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.partitionedfiletable:PartitionedFileTabularConnection.get_directory_path_bound_options",
          "options_callback_kwargs": null,
          "state_name": "file_info",
          "title": "Directory Path",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "connection_key",
        "file_type",
        "directory_path"
      ],
      "title": "PartitionedFileTabularConnection",
      "type": "object"
    },
    "SqlTabularConnection": {
      "properties": {
        "database_resource": {
          "description": "The name of the database resource to connect to.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_resources",
          "options_callback_kwargs": null,
          "state_name": "database_resource",
          "title": "Database Resource",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        },
        "database_name": {
          "description": "The name of the database to connect to.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_schemas",
          "options_callback_kwargs": null,
          "state_name": "database_name",
          "title": "Database Name",
          "tooltip": "Detail:\nNote: If you don\u2019t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.\n\nValidation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        },
        "table_name": {
          "description": "The name of the table to use.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_tables",
          "options_callback_kwargs": null,
          "state_name": "table_name",
          "title": "Table Name",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "database_resource",
        "database_name",
        "table_name"
      ],
      "title": "SqlTabularConnection",
      "type": "object"
    },
    "TabularConnection": {
      "description": "A shared parameter base model dedication to tabular connections.",
      "properties": {
        "tabular_connection": {
          "anyOf": [
            {
              "$ref": "#/$defs/SqlTabularConnection"
            },
            {
              "$ref": "#/$defs/FileTabularConnection"
            },
            {
              "$ref": "#/$defs/PartitionedFileTabularConnection"
            }
          ],
          "description": "The connection type to use to access the source data.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection",
          "title": "Connection",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        }
      },
      "required": [
        "tabular_connection"
      ],
      "title": "TabularConnection",
      "type": "object"
    }
  },
  "properties": {
    "clustering_data_input": {
      "$ref": "#/$defs/ClusteringDataInput",
      "description": "The data input configuration for the clustering analysis.",
      "field_type": "input",
      "input_component": {
        "component_type": "combobox",
        "show_search": true
      },
      "long_description": null,
      "options_callback": null,
      "options_callback_kwargs": null,
      "state_name": "ClusteringDataInput",
      "title": "Clustering Data Input",
      "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
    }
  },
  "required": [
    "clustering_data_input"
  ],
  "title": "ClusteringFitParams",
  "type": "object"
}

Artifacts:

Property	Type	Required	Description
`cluster_intersection`	`unknown`	Yes	Parquet file containing data about the clustering intersections and which cluster they belong to.
`cluster_descriptions`	`unknown`	Yes	Parquet file containing data about the clusters created by the clustering fit method.
`data_utilized`	`DataFrame`	Yes	Parquet file containing the data utilized in the clustering fit method.
`evaluation_metric`	`number`	Yes	Numeric evaluation metric produced during clustering model fitting. This may represent a cluster quality score or model selection criterion depending on the algorithm used (e.g., silhouette score for KMeans, BIC for Gaussian Mixture Models, DBCV for HDBSCAN).

Artifact Schema (JSON):

{
  "additionalProperties": true,
  "properties": {
    "cluster_intersection": {
      "description": "Parquet file containing data about the clustering intersections and which cluster they belong to.",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Clustering Intersection Results"
    },
    "cluster_descriptions": {
      "description": "Parquet file containing data about the clusters created by the clustering fit method.",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Clustering Descriptions"
    },
    "data_utilized": {
      "description": "Parquet file containing the data utilized in the clustering fit method.",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Data Utilized",
      "type": "DataFrame"
    },
    "evaluation_metric": {
      "description": "Numeric evaluation metric produced during clustering model fitting. This may represent a cluster quality score or model selection criterion depending on the algorithm used (e.g., silhouette score for KMeans, BIC for Gaussian Mixture Models, DBCV for HDBSCAN).",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Evaluation Metric",
      "type": "number"
    }
  },
  "required": [
    "cluster_intersection",
    "cluster_descriptions",
    "data_utilized",
    "evaluation_metric"
  ],
  "title": "ClusteringFitArtifacts",
  "type": "object"
}

2. Predict

Method Name: predict

Short Description: Abstract Predict Method

Detailed Description: This specifies the necessary input and output parameters for the predict method on all anomaly detection routines. The input parameters contain a source data definition and a time range to detect anomalies.

Inputs:

Property	Type	Required	Description
`datasource`	`#/$defs/TabularConnection`	Yes	Select the datasource containing observations to assign to clusters.

Input Schema (JSON):

{
  "$defs": {
    "FileExtensions_": {
      "description": "File Extensions.",
      "enum": [
        ".csv",
        ".tsv",
        ".psv",
        ".parquet",
        ".xlsx"
      ],
      "title": "FileExtensions_",
      "type": "string"
    },
    "FileTabularConnection": {
      "properties": {
        "connection_key": {
          "$ref": "#/$defs/MetaFileSystemConnectionKey",
          "description": "The MetaFileSystem connection key.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection_key",
          "title": "Connection Key",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "file_path": {
          "description": "The full file path to the file to ingest.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.filetable:FileTabularConnection.get_file_path_bound_options",
          "options_callback_kwargs": null,
          "state_name": "file_path",
          "title": "File Path",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "connection_key",
        "file_path"
      ],
      "title": "FileTabularConnection",
      "type": "object"
    },
    "MetaFileSystemConnectionKey": {
      "enum": [
        "sql-server-routine",
        "sql-server-shared"
      ],
      "title": "MetaFileSystemConnectionKey",
      "type": "string"
    },
    "PartitionedFileTabularConnection": {
      "properties": {
        "connection_key": {
          "$ref": "#/$defs/MetaFileSystemConnectionKey",
          "description": "The MetaFileSystem connection key.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection_key",
          "title": "Connection Key",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "file_type": {
          "$ref": "#/$defs/FileExtensions_",
          "description": "The type of files to read from the directory.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "file_info",
          "title": "File Type",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "directory_path": {
          "description": "The full directory path containing partitioned tabular files.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.partitionedfiletable:PartitionedFileTabularConnection.get_directory_path_bound_options",
          "options_callback_kwargs": null,
          "state_name": "file_info",
          "title": "Directory Path",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "connection_key",
        "file_type",
        "directory_path"
      ],
      "title": "PartitionedFileTabularConnection",
      "type": "object"
    },
    "SqlTabularConnection": {
      "properties": {
        "database_resource": {
          "description": "The name of the database resource to connect to.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_resources",
          "options_callback_kwargs": null,
          "state_name": "database_resource",
          "title": "Database Resource",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        },
        "database_name": {
          "description": "The name of the database to connect to.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_schemas",
          "options_callback_kwargs": null,
          "state_name": "database_name",
          "title": "Database Name",
          "tooltip": "Detail:\nNote: If you don\u2019t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.\n\nValidation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        },
        "table_name": {
          "description": "The name of the table to use.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_tables",
          "options_callback_kwargs": null,
          "state_name": "table_name",
          "title": "Table Name",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "database_resource",
        "database_name",
        "table_name"
      ],
      "title": "SqlTabularConnection",
      "type": "object"
    },
    "TabularConnection": {
      "description": "A shared parameter base model dedication to tabular connections.",
      "properties": {
        "tabular_connection": {
          "anyOf": [
            {
              "$ref": "#/$defs/SqlTabularConnection"
            },
            {
              "$ref": "#/$defs/FileTabularConnection"
            },
            {
              "$ref": "#/$defs/PartitionedFileTabularConnection"
            }
          ],
          "description": "The connection type to use to access the source data.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection",
          "title": "Connection",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        }
      },
      "required": [
        "tabular_connection"
      ],
      "title": "TabularConnection",
      "type": "object"
    }
  },
  "properties": {
    "datasource": {
      "$ref": "#/$defs/TabularConnection",
      "description": "Select the datasource containing observations to assign to clusters.",
      "field_type": "input",
      "input_component": {
        "component_type": "combobox",
        "show_search": true
      },
      "long_description": null,
      "options_callback": null,
      "options_callback_kwargs": null,
      "state_name": "PredictDataSelection",
      "title": "Prediction Datasource",
      "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
    }
  },
  "required": [
    "datasource"
  ],
  "title": "ClusteringAnalysisPredictParameters",
  "type": "object"
}

Artifacts:

Property	Type	Required	Description
`cluster_intersection`	`unknown`	Yes	Parquet file containing data about the clustering intersections and which cluster they belong to.
`data_utilized`	`DataFrame`	Yes	Parquet file containing the data utilized in the clustering predict method.

Artifact Schema (JSON):

{
  "additionalProperties": true,
  "properties": {
    "cluster_intersection": {
      "description": "Parquet file containing data about the clustering intersections and which cluster they belong to.",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Clustering Intersection Results"
    },
    "data_utilized": {
      "description": "Parquet file containing the data utilized in the clustering predict method.",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Data Utilized",
      "type": "DataFrame"
    }
  },
  "required": [
    "cluster_intersection",
    "data_utilized"
  ],
  "title": "ClusteringPredictArtifacts",
  "type": "object"
}

Developer Docs

Routine Typename: KMeansClusteringAnalysis

Method Name	Artifact Keys
`__init__`	N/A
`fit`	cluster_intersection, cluster_descriptions, data_utilized, evaluation_metric
`predict`	cluster_intersection, data_utilized

Versions​

v0.1.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. Performance-Based Clustering​

2. Attribute-Based Clustering​

Routine Methods​

1. Init (Constructor)​

2. Fit (Method)​

3. Predict (Method)​

Interface Definitions​

1. Clustering Analysis Interface​

1. Fit​

2. Predict​

Developer Docs​