HDBScanClusteringAnalysis

Versions

0.1.0

v0.1.0

Basic Information

Class Name: HDBScanClusteringAnalysis

Title: HDBScan

Version: 0.1.0

Author: Christian Reyes Aviña

Organization: OneStream

Creation Date: 2025-09-26

Default Routine Memory Capacity: 2 GB

Description

Short Description

HDBSCAN is a sophisticated data analysis technique used to discover natural groupings, or clusters within your data.

Long Description

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a sophisticated clustering algorithm that is used to discover natural groupings, or clusters, within your data. It is a density-based clustering algorithm that is able to discover clusters of arbitrary shape and is able to handle noise and outliers. It is a popular algorithm for clustering data and is often used in anomaly detection and outlier detection.

Use Cases

1. Customer Segmentation

You might choose to use this routine to discover niche, high-value segments in your data that traditional methods might overlook, enabling ultra-personalized customer segmentation and targeting strategies. HDBSCAN excels at identifying natural groupings in complex datasets without requiring you to pre-specify the number of clusters, which makes it especially useful in marketing contexts where the true structure of your customer base is unknown. By applying this approach, organizations can uncover hidden patterns that reveal high-value customers, profitable product associations, or underserved groups with unmet needs. This allows businesses to tailor campaigns, pricing strategies, and product development more effectively, ensuring resources are allocated to the most impactful segments. Furthermore, HDBSCAN is highly capable of handling noisy, sparse, or high-dimensional data, making it robust across a variety of industries such as retail, finance, and e-commerce. Its ability to separate meaningful customer clusters from noise gives decision-makers greater confidence in their segmentation efforts, ultimately driving better engagement, higher customer lifetime value, and stronger competitive advantage.

2. Fraud/Anomaly Detection

HDBSCAN is an exceptionally effective and widely adopted algorithm for identifying anomalies, outliers, and fraudulent behavior within complex datasets. Its power comes from its density-based nature, which is perfectly suited to the characteristics of risk and fraud: unusual events are often sparse and do not conform to normal patterns.

Unlike traditional methods that might struggle with messy, high-volume data, HDBSCAN doesn't try to force every data point into a cluster. Instead, it accurately identifies and forms clusters representing normal, legitimate behavior (e.g., standard customer transactions, typical network traffic, routine sensor readings).The algorithm's crucial advantage for risk management is its explicit handling of noise. Any data points that do not meet the minimum density requirement to belong to a stable cluster are automatically and intentionally classified as noise or outliers (marked with a label of -1). These outliers are precisely the transactions, log entries, or sensor readings that deviate from established norms. Because HDBSCAN can handle clusters of any shape—not just simple spheres—it can accurately model highly irregular, yet legitimate, patterns of behavior, ensuring that only true deviations are flagged. This capability drastically reduces false positives compared to methods like K-Means, which tend to improperly pull genuine anomalies into the periphery of normal clusters. By isolating these mathematically defined outliers, HDBSCAN provides a powerful, automated mechanism to flag potential fraud or critical system anomalies for immediate, targeted investigation. This results in quicker response times and enhanced loss prevention.

3. Operations Optimization

HDBSCAN can be applied to operational data to uncover clusters of similar incidents, workflows, or logistics routes, allowing organizations to streamline processes, reduce redundancy, and improve overall efficiency. By automatically grouping related events, the algorithm highlights recurring patterns that may point to inefficiencies, capacity issues, or areas where resources are over- or under-utilized. Importantly, data points that do not fall into any cluster are identified as outliers, which can be especially valuable for flagging unusual disruptions, potential risks, or hidden opportunities. These outliers may represent isolated inefficiencies, rare operational challenges, or innovative cases that could inform future best practices. With its ability to handle noisy, high-dimensional, and large-scale datasets, HDBSCAN empowers teams in logistics, manufacturing, IT operations, and supply chain management to make data-driven decisions that reduce costs, improve service reliability, and enhance resource allocation strategies across the organization.

4. Market Analysis

HDBSCAN can be leveraged to uncover natural clusters of similar products, customers, or services, providing a deeper understanding of how markets behave beyond traditional segmentation techniques. By analyzing these clusters, businesses can identify which groups of products or customers drive the highest demand, where underserved niches exist, and how consumer preferences shift over time. This insight allows organizations to refine pricing strategies, tailor promotions, and create more effective marketing campaigns that resonate with specific audiences. Unlike methods that require a fixed number of clusters, HDBSCAN adapts to the underlying data distribution, which makes it especially valuable when market structures are uncertain or highly dynamic. Outliers identified during clustering may reveal emerging trends, disruptive competitors, or unmet customer needs, providing opportunities to innovate or pivot strategically. Overall, HDBSCAN offers a powerful, data-driven foundation for optimizing pricing, enhancing competitiveness, and making informed strategic decisions in fast-changing markets.

Routine Methods

1. Init (Constructor)

Method: __init__
- Type: Constructor
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: Yes
- Read Only: No
- Method Limits: N/A
- Outputs Dynamic Artifacts: No
- Short Description:
  - Initializes the ClusteringAnalysis routine with the provided API and parameters.
- Detailed Description:
  - This constructor sets up this instance of the clustering analysis routine. Since HDBSCAN is a non-parameterized model no need for its own constructor implementation, therefore we can use the ClusteringConstructor
- Inputs:
  - Required Input
    - Deterministic Model Configuration: Whether or not to use deterministic clustering algorithm for this analysis.
      - Name: deterministic_model
      - Tooltip:
        
        Detail:
        
        Please define if the clustering algorithm is to be deterministic for this analysis.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: bool
    - Minimum Cluster Size: The minimum size of clusters. Defaults to 5 if not provided.
      - Name: min_cluster_size
      - Tooltip:
        
        Detail:
        
        Minimum cluster size
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: int
    - Minimum Sample Size: The number of samples in a neighborhood for a point to be considered a core point (including the point itself). Defaults to the value of min_cluster_size if not provided.
      - Name: min_samples
      - Tooltip:
        
        Detail:
        
        Minimum samples
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: int
    - Model Configuration: The name of the clustering algorithm to use for this analysis.
      - Name: clustering_algorithm_name
      - Tooltip:
        
        Detail:
        
        The clustering algorithm to use for this analysis.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Literal
- Artifacts: No artifacts are returned by this method

2. Fit (Method)

Method: fit
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: During scale testing this method performed with datasets up to 900,000 rows and 10 feature columns without issues. Larger datasets may cause a timeout error depending on system resources and execution environment.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Fits the clustering analysis model to the provided parameters.
- Detailed Description:
  - This method will take the parameters provided by the user and fit the clustering analysis model to them. This will include clustering dimensions, feature dimensions, etc. The user can specify the number of clusters, the clustering algorithm, and the feature weighting method to use for the analysis.
- Inputs:
  - Required Input
    - Clustering Data Input: The data input configuration for the clustering analysis.
      - Name: clustering_data_input
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Clustering Data Configuration
      - Nested Model: Clustering Data Configuration
        
        Required Input
        
        Source Data Definition: Source Data Definition.
        
        Name: source_data_definition
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be an instance of Tabular Connection
        
        Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Clustering Dimensions: The unique combination of column values that define the “entity” that you are trying to compare to others.
        
        Name: clustering_dimensions
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
        
        Feature Columns: Columns that you want to use to calculate the cluster segments.
        
        Name: feature_columns
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
- Artifacts:
  - Clustering Intersection Results: Parquet file containing data about the clustering intersections and which cluster they belong to.
    - Qualified Key Annotation: cluster_intersection
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cluster_intersection/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Clustering Descriptions: Parquet file containing data about the clusters created by the clustering fit method.
    - Qualified Key Annotation: cluster_descriptions
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cluster_descriptions/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Data Utilized: Parquet file containing the data utilized in the clustering fit method.
    - Qualified Key Annotation: data_utilized
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@data_utilized/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

3. Predict (Method)

Method: predict
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: During scale testing this method performed with datasets up to 900,000 rows and 10 feature columns without issues. Larger datasets may cause a timeout error depending on system resources and execution environment.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Makes predictions on the provided data using the fitted model
- Detailed Description:
  - This method will take the parameters provided by the user and make predictions on the provided data using the fitted model. The user must provide a data source that contains the same clustering dimensions and feature dimensions as the data used to fit the model. The method will return a dataframe with the assigned clusters.
- Inputs:
  - Required Input
    - Prediction Datasource: Select the datasource containing observations to assign to clusters.
      - Name: datasource
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
- Artifacts:
  - Clustering Intersection Results: Parquet file containing data about the clustering intersections and which cluster they belong to.
    - Qualified Key Annotation: cluster_intersection
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cluster_intersection/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Data Utilized: Parquet file containing the data utilized in the clustering predict method.
    - Qualified Key Annotation: data_utilized
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@data_utilized/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

1. Clustering Analysis Interface

An interface class requiring fit and predict methods to be implemented.

This BaseRoutineInterface class enforces a common interface for all clustering routines. The interface requires each clustering routine to implement a fit method and a predict method with the same input parameters. Each concrete class will have constructor methods where hyperparameters specific to the clustering algorithm may be set, however, this interface does not enforce any specific constructor method.

Interface Methods:

1. Fit

Method Name: fit

Short Description: Abstract Fit Method

Detailed Description: This specifies the necessary input and output parameters for the fit method on all anomaly detection routines. The input parameters contain a source data definition and time range to fit an anomaly detector to.

Inputs:

Property	Type	Required	Description
`clustering_data_input`	`#/$defs/ClusteringDataInput`	Yes	The data input configuration for the clustering analysis.

Input Schema (JSON):

{
  "$defs": {
    "ClusteringDataInput": {
      "properties": {
        "source_data_definition": {
          "$ref": "#/$defs/TabularConnection",
          "description": "Source Data Definition",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "SourceDataDefinition",
          "title": "Source Data Definition",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "clustering_dimensions": {
          "description": "The unique combination of column values that define the \u201centity\u201d that you are trying to compare to others.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "items": {
            "type": "string"
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.store.clustering_analysis.clustering.pbm.clustering_pbms:ClusteringDataInput.get_dimension_options",
          "options_callback_kwargs": null,
          "state_name": "ClusteringDimensions",
          "title": "Clustering Dimensions",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "array"
        },
        "feature_columns": {
          "description": "Columns that you want to use to calculate the cluster segments",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "items": {
            "type": "string"
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.store.clustering_analysis.clustering.pbm.clustering_pbms:ClusteringDataInput.get_feature_options",
          "options_callback_kwargs": null,
          "state_name": "FeatureColumns",
          "title": "Feature Columns",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "array"
        }
      },
      "required": [
        "source_data_definition",
        "clustering_dimensions",
        "feature_columns"
      ],
      "title": "ClusteringDataInput",
      "type": "object"
    },
    "FileExtensions_": {
      "description": "File Extensions.",
      "enum": [
        ".csv",
        ".tsv",
        ".psv",
        ".parquet",
        ".xlsx"
      ],
      "title": "FileExtensions_",
      "type": "string"
    },
    "FileTabularConnection": {
      "properties": {
        "connection_key": {
          "$ref": "#/$defs/MetaFileSystemConnectionKey",
          "description": "The MetaFileSystem connection key.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection_key",
          "title": "Connection Key",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "file_path": {
          "description": "The full file path to the file to ingest.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.filetable:FileTabularConnection.get_file_path_bound_options",
          "options_callback_kwargs": null,
          "state_name": "file_path",
          "title": "File Path",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "connection_key",
        "file_path"
      ],
      "title": "FileTabularConnection",
      "type": "object"
    },
    "MetaFileSystemConnectionKey": {
      "enum": [
        "sql-server-routine",
        "sql-server-shared"
      ],
      "title": "MetaFileSystemConnectionKey",
      "type": "string"
    },
    "PartitionedFileTabularConnection": {
      "properties": {
        "connection_key": {
          "$ref": "#/$defs/MetaFileSystemConnectionKey",
          "description": "The MetaFileSystem connection key.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection_key",
          "title": "Connection Key",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "file_type": {
          "$ref": "#/$defs/FileExtensions_",
          "description": "The type of files to read from the directory.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "file_info",
          "title": "File Type",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "directory_path": {
          "description": "The full directory path containing partitioned tabular files.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.partitionedfiletable:PartitionedFileTabularConnection.get_directory_path_bound_options",
          "options_callback_kwargs": null,
          "state_name": "file_info",
          "title": "Directory Path",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "connection_key",
        "file_type",
        "directory_path"
      ],
      "title": "PartitionedFileTabularConnection",
      "type": "object"
    },
    "SqlTabularConnection": {
      "properties": {
        "database_resource": {
          "description": "The name of the database resource to connect to.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_resources",
          "options_callback_kwargs": null,
          "state_name": "database_resource",
          "title": "Database Resource",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        },
        "database_name": {
          "description": "The name of the database to connect to.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_schemas",
          "options_callback_kwargs": null,
          "state_name": "database_name",
          "title": "Database Name",
          "tooltip": "Detail:\nNote: If you don\u2019t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.\n\nValidation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        },
        "table_name": {
          "description": "The name of the table to use.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_tables",
          "options_callback_kwargs": null,
          "state_name": "table_name",
          "title": "Table Name",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "database_resource",
        "database_name",
        "table_name"
      ],
      "title": "SqlTabularConnection",
      "type": "object"
    },
    "TabularConnection": {
      "description": "A shared parameter base model dedication to tabular connections.",
      "properties": {
        "tabular_connection": {
          "anyOf": [
            {
              "$ref": "#/$defs/SqlTabularConnection"
            },
            {
              "$ref": "#/$defs/FileTabularConnection"
            },
            {
              "$ref": "#/$defs/PartitionedFileTabularConnection"
            }
          ],
          "description": "The connection type to use to access the source data.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection",
          "title": "Connection",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        }
      },
      "required": [
        "tabular_connection"
      ],
      "title": "TabularConnection",
      "type": "object"
    }
  },
  "properties": {
    "clustering_data_input": {
      "$ref": "#/$defs/ClusteringDataInput",
      "description": "The data input configuration for the clustering analysis.",
      "field_type": "input",
      "input_component": {
        "component_type": "combobox",
        "show_search": true
      },
      "long_description": null,
      "options_callback": null,
      "options_callback_kwargs": null,
      "state_name": "ClusteringDataInput",
      "title": "Clustering Data Input",
      "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
    }
  },
  "required": [
    "clustering_data_input"
  ],
  "title": "ClusteringFitParams",
  "type": "object"
}

Artifacts:

Property	Type	Required	Description
`cluster_intersection`	`unknown`	Yes	Parquet file containing data about the clustering intersections and which cluster they belong to.
`cluster_descriptions`	`unknown`	Yes	Parquet file containing data about the clusters created by the clustering fit method.
`data_utilized`	`DataFrame`	Yes	Parquet file containing the data utilized in the clustering fit method.

Artifact Schema (JSON):

{
  "additionalProperties": true,
  "properties": {
    "cluster_intersection": {
      "description": "Parquet file containing data about the clustering intersections and which cluster they belong to.",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Clustering Intersection Results"
    },
    "cluster_descriptions": {
      "description": "Parquet file containing data about the clusters created by the clustering fit method.",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Clustering Descriptions"
    },
    "data_utilized": {
      "description": "Parquet file containing the data utilized in the clustering fit method.",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Data Utilized",
      "type": "DataFrame"
    }
  },
  "required": [
    "cluster_intersection",
    "cluster_descriptions",
    "data_utilized"
  ],
  "title": "ClusteringFitArtifacts",
  "type": "object"
}

2. Predict

Method Name: predict

Short Description: Abstract Predict Method

Detailed Description: This specifies the necessary input and output parameters for the predict method on all anomaly detection routines. The input parameters contain a source data definition and a time range to detect anomalies.

Inputs:

Property	Type	Required	Description
`datasource`	`#/$defs/TabularConnection`	Yes	Select the datasource containing observations to assign to clusters.

Input Schema (JSON):

{
  "$defs": {
    "FileExtensions_": {
      "description": "File Extensions.",
      "enum": [
        ".csv",
        ".tsv",
        ".psv",
        ".parquet",
        ".xlsx"
      ],
      "title": "FileExtensions_",
      "type": "string"
    },
    "FileTabularConnection": {
      "properties": {
        "connection_key": {
          "$ref": "#/$defs/MetaFileSystemConnectionKey",
          "description": "The MetaFileSystem connection key.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection_key",
          "title": "Connection Key",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "file_path": {
          "description": "The full file path to the file to ingest.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.filetable:FileTabularConnection.get_file_path_bound_options",
          "options_callback_kwargs": null,
          "state_name": "file_path",
          "title": "File Path",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "connection_key",
        "file_path"
      ],
      "title": "FileTabularConnection",
      "type": "object"
    },
    "MetaFileSystemConnectionKey": {
      "enum": [
        "sql-server-routine",
        "sql-server-shared"
      ],
      "title": "MetaFileSystemConnectionKey",
      "type": "string"
    },
    "PartitionedFileTabularConnection": {
      "properties": {
        "connection_key": {
          "$ref": "#/$defs/MetaFileSystemConnectionKey",
          "description": "The MetaFileSystem connection key.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection_key",
          "title": "Connection Key",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "file_type": {
          "$ref": "#/$defs/FileExtensions_",
          "description": "The type of files to read from the directory.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "file_info",
          "title": "File Type",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        },
        "directory_path": {
          "description": "The full directory path containing partitioned tabular files.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.partitionedfiletable:PartitionedFileTabularConnection.get_directory_path_bound_options",
          "options_callback_kwargs": null,
          "state_name": "file_info",
          "title": "Directory Path",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "connection_key",
        "file_type",
        "directory_path"
      ],
      "title": "PartitionedFileTabularConnection",
      "type": "object"
    },
    "SqlTabularConnection": {
      "properties": {
        "database_resource": {
          "description": "The name of the database resource to connect to.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_resources",
          "options_callback_kwargs": null,
          "state_name": "database_resource",
          "title": "Database Resource",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        },
        "database_name": {
          "description": "The name of the database to connect to.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_schemas",
          "options_callback_kwargs": null,
          "state_name": "database_name",
          "title": "Database Name",
          "tooltip": "Detail:\nNote: If you don\u2019t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.\n\nValidation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        },
        "table_name": {
          "description": "The name of the table to use.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_tables",
          "options_callback_kwargs": null,
          "state_name": "table_name",
          "title": "Table Name",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
          "type": "string"
        }
      },
      "required": [
        "database_resource",
        "database_name",
        "table_name"
      ],
      "title": "SqlTabularConnection",
      "type": "object"
    },
    "TabularConnection": {
      "description": "A shared parameter base model dedication to tabular connections.",
      "properties": {
        "tabular_connection": {
          "anyOf": [
            {
              "$ref": "#/$defs/SqlTabularConnection"
            },
            {
              "$ref": "#/$defs/FileTabularConnection"
            },
            {
              "$ref": "#/$defs/PartitionedFileTabularConnection"
            }
          ],
          "description": "The connection type to use to access the source data.",
          "field_type": "input",
          "input_component": {
            "component_type": "combobox",
            "show_search": true
          },
          "long_description": null,
          "options_callback": null,
          "options_callback_kwargs": null,
          "state_name": "connection",
          "title": "Connection",
          "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
        }
      },
      "required": [
        "tabular_connection"
      ],
      "title": "TabularConnection",
      "type": "object"
    }
  },
  "properties": {
    "datasource": {
      "$ref": "#/$defs/TabularConnection",
      "description": "Select the datasource containing observations to assign to clusters.",
      "field_type": "input",
      "input_component": {
        "component_type": "combobox",
        "show_search": true
      },
      "long_description": null,
      "options_callback": null,
      "options_callback_kwargs": null,
      "state_name": "PredictDataSelection",
      "title": "Prediction Datasource",
      "tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
    }
  },
  "required": [
    "datasource"
  ],
  "title": "ClusteringAnalysisPredictParameters",
  "type": "object"
}

Artifacts:

Property	Type	Required	Description
`cluster_intersection`	`unknown`	Yes	Parquet file containing data about the clustering intersections and which cluster they belong to.
`data_utilized`	`DataFrame`	Yes	Parquet file containing the data utilized in the clustering predict method.

Artifact Schema (JSON):

{
  "additionalProperties": true,
  "properties": {
    "cluster_intersection": {
      "description": "Parquet file containing data about the clustering intersections and which cluster they belong to.",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Clustering Intersection Results"
    },
    "data_utilized": {
      "description": "Parquet file containing the data utilized in the clustering predict method.",
      "io_factory_kwargs": {},
      "preview_factory_kwargs": null,
      "preview_factory_type": null,
      "statistic_factory_kwargs": null,
      "statistic_factory_type": null,
      "title": "Data Utilized",
      "type": "DataFrame"
    }
  },
  "required": [
    "cluster_intersection",
    "data_utilized"
  ],
  "title": "ClusteringPredictArtifacts",
  "type": "object"
}

Developer Docs

Routine Typename: HDBScanClusteringAnalysis

Method Name	Artifact Keys
`__init__`	N/A
`fit`	cluster_intersection, cluster_descriptions, data_utilized
`predict`	cluster_intersection, data_utilized

Versions​

v0.1.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. Customer Segmentation​

2. Fraud/Anomaly Detection​

3. Operations Optimization​

4. Market Analysis​

Routine Methods​

1. Init (Constructor)​

2. Fit (Method)​

3. Predict (Method)​

Interface Definitions​

1. Clustering Analysis Interface​

1. Fit​

2. Predict​

Developer Docs​