HDBScanClusteringAnalysis
Versions
v0.1.0
Basic Information
Class Name: HDBScanClusteringAnalysis
Title: HDBScan
Version: 0.1.0
Author: Christian Reyes Aviña
Organization: OneStream
Creation Date: 2025-09-26
Default Routine Memory Capacity: 2 GB
Tags
Pattern Recognition, Unsupervised, Clustering, Data Analysis
Description
Short Description
HDBSCAN is a sophisticated data analysis technique used to discover natural groupings, or clusters within your data.
Long Description
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a sophisticated clustering algorithm that is used to discover natural groupings, or clusters, within your data. It is a density-based clustering algorithm that is able to discover clusters of arbitrary shape and is able to handle noise and outliers. It is a popular algorithm for clustering data and is often used in anomaly detection and outlier detection.
Use Cases
1. Customer Segmentation
You might choose to use this routine to discover niche, high-value segments in your data that traditional methods might overlook, enabling ultra-personalized customer segmentation and targeting strategies. HDBSCAN excels at identifying natural groupings in complex datasets without requiring you to pre-specify the number of clusters, which makes it especially useful in marketing contexts where the true structure of your customer base is unknown. By applying this approach, organizations can uncover hidden patterns that reveal high-value customers, profitable product associations, or underserved groups with unmet needs. This allows businesses to tailor campaigns, pricing strategies, and product development more effectively, ensuring resources are allocated to the most impactful segments. Furthermore, HDBSCAN is highly capable of handling noisy, sparse, or high-dimensional data, making it robust across a variety of industries such as retail, finance, and e-commerce. Its ability to separate meaningful customer clusters from noise gives decision-makers greater confidence in their segmentation efforts, ultimately driving better engagement, higher customer lifetime value, and stronger competitive advantage.
2. Fraud/Anomaly Detection
HDBSCAN is an exceptionally effective and widely adopted algorithm for identifying anomalies, outliers, and fraudulent behavior within complex datasets. Its power comes from its density-based nature, which is perfectly suited to the characteristics of risk and fraud: unusual events are often sparse and do not conform to normal patterns.
Unlike traditional methods that might struggle with messy, high-volume data, HDBSCAN doesn't try to force every data point into a cluster. Instead, it accurately identifies and forms clusters representing normal, legitimate behavior (e.g., standard customer transactions, typical network traffic, routine sensor readings).The algorithm's crucial advantage for risk management is its explicit handling of noise. Any data points that do not meet the minimum density requirement to belong to a stable cluster are automatically and intentionally classified as noise or outliers (marked with a label of -1). These outliers are precisely the transactions, log entries, or sensor readings that deviate from established norms. Because HDBSCAN can handle clusters of any shape—not just simple spheres—it can accurately model highly irregular, yet legitimate, patterns of behavior, ensuring that only true deviations are flagged. This capability drastically reduces false positives compared to methods like K-Means, which tend to improperly pull genuine anomalies into the periphery of normal clusters. By isolating these mathematically defined outliers, HDBSCAN provides a powerful, automated mechanism to flag potential fraud or critical system anomalies for immediate, targeted investigation. This results in quicker response times and enhanced loss prevention.
3. Operations Optimization
HDBSCAN can be applied to operational data to uncover clusters of similar incidents, workflows, or logistics routes, allowing organizations to streamline processes, reduce redundancy, and improve overall efficiency. By automatically grouping related events, the algorithm highlights recurring patterns that may point to inefficiencies, capacity issues, or areas where resources are over- or under-utilized. Importantly, data points that do not fall into any cluster are identified as outliers, which can be especially valuable for flagging unusual disruptions, potential risks, or hidden opportunities. These outliers may represent isolated inefficiencies, rare operational challenges, or innovative cases that could inform future best practices. With its ability to handle noisy, high-dimensional, and large-scale datasets, HDBSCAN empowers teams in logistics, manufacturing, IT operations, and supply chain management to make data-driven decisions that reduce costs, improve service reliability, and enhance resource allocation strategies across the organization.
4. Market Analysis
HDBSCAN can be leveraged to uncover natural clusters of similar products, customers, or services, providing a deeper understanding of how markets behave beyond traditional segmentation techniques. By analyzing these clusters, businesses can identify which groups of products or customers drive the highest demand, where underserved niches exist, and how consumer preferences shift over time. This insight allows organizations to refine pricing strategies, tailor promotions, and create more effective marketing campaigns that resonate with specific audiences. Unlike methods that require a fixed number of clusters, HDBSCAN adapts to the underlying data distribution, which makes it especially valuable when market structures are uncertain or highly dynamic. Outliers identified during clustering may reveal emerging trends, disruptive competitors, or unmet customer needs, providing opportunities to innovate or pivot strategically. Overall, HDBSCAN offers a powerful, data-driven foundation for optimizing pricing, enhancing competitiveness, and making informed strategic decisions in fast-changing markets.
Routine Methods
1. Init (Constructor)
- Method:
__init__-
Type: Constructor
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: Yes
-
Read Only: No
-
Method Limits: N/A
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Initializes the ClusteringAnalysis routine with the provided API and parameters.
-
Detailed Description:
- This constructor sets up this instance of the clustering analysis routine. Since HDBSCAN is a non-parameterized model no need for its own constructor implementation, therefore we can use the ClusteringConstructor
-
Inputs:
- Required Input
- Deterministic Model Configuration: Whether or not to use deterministic clustering algorithm for this analysis.
- Name:
deterministic_model - Tooltip:
- Detail:
- Please define if the clustering algorithm is to be deterministic for this analysis.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: bool
- Name:
- Minimum Cluster Size: The minimum size of clusters. Defaults to 5 if not provided.
- Name:
min_cluster_size - Tooltip:
- Detail:
- Minimum cluster size
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Minimum Sample Size: The number of samples in a neighborhood for a point to be considered a core point (including the point itself). Defaults to the value of min_cluster_size if not provided.
- Name:
min_samples - Tooltip:
- Detail:
- Minimum samples
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Model Configuration: The name of the clustering algorithm to use for this analysis.
- Name:
clustering_algorithm_name - Tooltip:
- Detail:
- The clustering algorithm to use for this analysis.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Literal
- Name:
- Deterministic Model Configuration: Whether or not to use deterministic clustering algorithm for this analysis.
- Required Input
-
Artifacts: No artifacts are returned by this method
-
2. Fit (Method)
- Method:
fit-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: During scale testing this method performed with datasets up to 900,000 rows and 10 feature columns without issues. Larger datasets may cause a timeout error depending on system resources and execution environment.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Fits the clustering analysis model to the provided parameters.
-
Detailed Description:
- This method will take the parameters provided by the user and fit the clustering analysis model to them. This will include clustering dimensions, feature dimensions, etc. The user can specify the number of clusters, the clustering algorithm, and the feature weighting method to use for the analysis.
-
Inputs:
- Required Input
- Clustering Data Input: The data input configuration for the clustering analysis.
- Name:
clustering_data_input - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Clustering Data Configuration
- Nested Model: Clustering Data Configuration
- Required Input
- Source Data Definition: Source Data Definition.
- Name:
source_data_definition - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Clustering Dimensions: The unique combination of column values that define the “entity” that you are trying to compare to others.
- Name:
clustering_dimensions - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Feature Columns: Columns that you want to use to calculate the cluster segments.
- Name:
feature_columns - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Source Data Definition: Source Data Definition.
- Required Input
- Name:
- Clustering Data Input: The data input configuration for the clustering analysis.
- Required Input
-
Artifacts:
-
Clustering Intersection Results: Parquet file containing data about the clustering intersections and which cluster they belong to.
- Qualified Key Annotation:
cluster_intersection - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cluster_intersection/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
Clustering Descriptions: Parquet file containing data about the clusters created by the clustering fit method.
- Qualified Key Annotation:
cluster_descriptions - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cluster_descriptions/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
Data Utilized: Parquet file containing the data utilized in the clustering fit method.
- Qualified Key Annotation:
data_utilized - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@data_utilized/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
-
3. Predict (Method)
- Method:
predict-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: During scale testing this method performed with datasets up to 900,000 rows and 10 feature columns without issues. Larger datasets may cause a timeout error depending on system resources and execution environment.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Makes predictions on the provided data using the fitted model
-
Detailed Description:
- This method will take the parameters provided by the user and make predictions on the provided data using the fitted model. The user must provide a data source that contains the same clustering dimensions and feature dimensions as the data used to fit the model. The method will return a dataframe with the assigned clusters.
-
Inputs:
- Required Input
- Prediction Datasource: Select the datasource containing observations to assign to clusters.
- Name:
datasource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Prediction Datasource: Select the datasource containing observations to assign to clusters.
- Required Input
-
Artifacts:
-
Clustering Intersection Results: Parquet file containing data about the clustering intersections and which cluster they belong to.
- Qualified Key Annotation:
cluster_intersection - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cluster_intersection/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
Data Utilized: Parquet file containing the data utilized in the clustering predict method.
- Qualified Key Annotation:
data_utilized - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@data_utilized/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
-
Interface Definitions
1. Clustering Analysis Interface
An interface class requiring fit and predict methods to be implemented.
This BaseRoutineInterface class enforces a common interface for all clustering routines. The interface requires each clustering routine to implement a fit method and a predict method with the same input parameters. Each concrete class will have constructor methods where hyperparameters specific to the clustering algorithm may be set, however, this interface does not enforce any specific constructor method.
Interface Methods:
1. Fit
Method Name: fit
Short Description: Abstract Fit Method
Detailed Description: This specifies the necessary input and output parameters for the fit method on all anomaly detection routines. The input parameters contain a source data definition and time range to fit an anomaly detector to.
Inputs:
| Property | Type | Required | Description |
|---|---|---|---|
clustering_data_input | #/$defs/ClusteringDataInput | Yes | The data input configuration for the clustering analysis. |
Input Schema (JSON):
{
"$defs": {
"ClusteringDataInput": {
"properties": {
"source_data_definition": {
"$ref": "#/$defs/TabularConnection",
"description": "Source Data Definition",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "SourceDataDefinition",
"title": "Source Data Definition",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
},
"clustering_dimensions": {
"description": "The unique combination of column values that define the \u201centity\u201d that you are trying to compare to others.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"items": {
"type": "string"
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.store.clustering_analysis.clustering.pbm.clustering_pbms:ClusteringDataInput.get_dimension_options",
"options_callback_kwargs": null,
"state_name": "ClusteringDimensions",
"title": "Clustering Dimensions",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "array"
},
"feature_columns": {
"description": "Columns that you want to use to calculate the cluster segments",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"items": {
"type": "string"
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.store.clustering_analysis.clustering.pbm.clustering_pbms:ClusteringDataInput.get_feature_options",
"options_callback_kwargs": null,
"state_name": "FeatureColumns",
"title": "Feature Columns",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "array"
}
},
"required": [
"source_data_definition",
"clustering_dimensions",
"feature_columns"
],
"title": "ClusteringDataInput",
"type": "object"
},
"FileExtensions_": {
"description": "File Extensions.",
"enum": [
".csv",
".tsv",
".psv",
".parquet",
".xlsx"
],
"title": "FileExtensions_",
"type": "string"
},
"FileTabularConnection": {
"properties": {
"connection_key": {
"$ref": "#/$defs/MetaFileSystemConnectionKey",
"description": "The MetaFileSystem connection key.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "connection_key",
"title": "Connection Key",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
},
"file_path": {
"description": "The full file path to the file to ingest.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.pbm.store.conn.filetable:FileTabularConnection.get_file_path_bound_options",
"options_callback_kwargs": null,
"state_name": "file_path",
"title": "File Path",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "string"
}
},
"required": [
"connection_key",
"file_path"
],
"title": "FileTabularConnection",
"type": "object"
},
"MetaFileSystemConnectionKey": {
"enum": [
"sql-server-routine",
"sql-server-shared"
],
"title": "MetaFileSystemConnectionKey",
"type": "string"
},
"PartitionedFileTabularConnection": {
"properties": {
"connection_key": {
"$ref": "#/$defs/MetaFileSystemConnectionKey",
"description": "The MetaFileSystem connection key.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "connection_key",
"title": "Connection Key",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
},
"file_type": {
"$ref": "#/$defs/FileExtensions_",
"description": "The type of files to read from the directory.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "file_info",
"title": "File Type",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
},
"directory_path": {
"description": "The full directory path containing partitioned tabular files.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.pbm.store.conn.partitionedfiletable:PartitionedFileTabularConnection.get_directory_path_bound_options",
"options_callback_kwargs": null,
"state_name": "file_info",
"title": "Directory Path",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "string"
}
},
"required": [
"connection_key",
"file_type",
"directory_path"
],
"title": "PartitionedFileTabularConnection",
"type": "object"
},
"SqlTabularConnection": {
"properties": {
"database_resource": {
"description": "The name of the database resource to connect to.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_resources",
"options_callback_kwargs": null,
"state_name": "database_resource",
"title": "Database Resource",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "string"
},
"database_name": {
"description": "The name of the database to connect to.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_schemas",
"options_callback_kwargs": null,
"state_name": "database_name",
"title": "Database Name",
"tooltip": "Detail:\nNote: If you don\u2019t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.\n\nValidation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "string"
},
"table_name": {
"description": "The name of the table to use.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_tables",
"options_callback_kwargs": null,
"state_name": "table_name",
"title": "Table Name",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "string"
}
},
"required": [
"database_resource",
"database_name",
"table_name"
],
"title": "SqlTabularConnection",
"type": "object"
},
"TabularConnection": {
"description": "A shared parameter base model dedication to tabular connections.",
"properties": {
"tabular_connection": {
"anyOf": [
{
"$ref": "#/$defs/SqlTabularConnection"
},
{
"$ref": "#/$defs/FileTabularConnection"
},
{
"$ref": "#/$defs/PartitionedFileTabularConnection"
}
],
"description": "The connection type to use to access the source data.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "connection",
"title": "Connection",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
}
},
"required": [
"tabular_connection"
],
"title": "TabularConnection",
"type": "object"
}
},
"properties": {
"clustering_data_input": {
"$ref": "#/$defs/ClusteringDataInput",
"description": "The data input configuration for the clustering analysis.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "ClusteringDataInput",
"title": "Clustering Data Input",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
}
},
"required": [
"clustering_data_input"
],
"title": "ClusteringFitParams",
"type": "object"
}
Artifacts:
| Property | Type | Required | Description |
|---|---|---|---|
cluster_intersection | unknown | Yes | Parquet file containing data about the clustering intersections and which cluster they belong to. |
cluster_descriptions | unknown | Yes | Parquet file containing data about the clusters created by the clustering fit method. |
data_utilized | DataFrame | Yes | Parquet file containing the data utilized in the clustering fit method. |
Artifact Schema (JSON):
{
"additionalProperties": true,
"properties": {
"cluster_intersection": {
"description": "Parquet file containing data about the clustering intersections and which cluster they belong to.",
"io_factory_kwargs": {},
"preview_factory_kwargs": null,
"preview_factory_type": null,
"statistic_factory_kwargs": null,
"statistic_factory_type": null,
"title": "Clustering Intersection Results"
},
"cluster_descriptions": {
"description": "Parquet file containing data about the clusters created by the clustering fit method.",
"io_factory_kwargs": {},
"preview_factory_kwargs": null,
"preview_factory_type": null,
"statistic_factory_kwargs": null,
"statistic_factory_type": null,
"title": "Clustering Descriptions"
},
"data_utilized": {
"description": "Parquet file containing the data utilized in the clustering fit method.",
"io_factory_kwargs": {},
"preview_factory_kwargs": null,
"preview_factory_type": null,
"statistic_factory_kwargs": null,
"statistic_factory_type": null,
"title": "Data Utilized",
"type": "DataFrame"
}
},
"required": [
"cluster_intersection",
"cluster_descriptions",
"data_utilized"
],
"title": "ClusteringFitArtifacts",
"type": "object"
}
2. Predict
Method Name: predict
Short Description: Abstract Predict Method
Detailed Description: This specifies the necessary input and output parameters for the predict method on all anomaly detection routines. The input parameters contain a source data definition and a time range to detect anomalies.
Inputs:
| Property | Type | Required | Description |
|---|---|---|---|
datasource | #/$defs/TabularConnection | Yes | Select the datasource containing observations to assign to clusters. |
Input Schema (JSON):
{
"$defs": {
"FileExtensions_": {
"description": "File Extensions.",
"enum": [
".csv",
".tsv",
".psv",
".parquet",
".xlsx"
],
"title": "FileExtensions_",
"type": "string"
},
"FileTabularConnection": {
"properties": {
"connection_key": {
"$ref": "#/$defs/MetaFileSystemConnectionKey",
"description": "The MetaFileSystem connection key.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "connection_key",
"title": "Connection Key",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
},
"file_path": {
"description": "The full file path to the file to ingest.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.pbm.store.conn.filetable:FileTabularConnection.get_file_path_bound_options",
"options_callback_kwargs": null,
"state_name": "file_path",
"title": "File Path",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "string"
}
},
"required": [
"connection_key",
"file_path"
],
"title": "FileTabularConnection",
"type": "object"
},
"MetaFileSystemConnectionKey": {
"enum": [
"sql-server-routine",
"sql-server-shared"
],
"title": "MetaFileSystemConnectionKey",
"type": "string"
},
"PartitionedFileTabularConnection": {
"properties": {
"connection_key": {
"$ref": "#/$defs/MetaFileSystemConnectionKey",
"description": "The MetaFileSystem connection key.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "connection_key",
"title": "Connection Key",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
},
"file_type": {
"$ref": "#/$defs/FileExtensions_",
"description": "The type of files to read from the directory.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "file_info",
"title": "File Type",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
},
"directory_path": {
"description": "The full directory path containing partitioned tabular files.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.pbm.store.conn.partitionedfiletable:PartitionedFileTabularConnection.get_directory_path_bound_options",
"options_callback_kwargs": null,
"state_name": "file_info",
"title": "Directory Path",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "string"
}
},
"required": [
"connection_key",
"file_type",
"directory_path"
],
"title": "PartitionedFileTabularConnection",
"type": "object"
},
"SqlTabularConnection": {
"properties": {
"database_resource": {
"description": "The name of the database resource to connect to.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_resources",
"options_callback_kwargs": null,
"state_name": "database_resource",
"title": "Database Resource",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "string"
},
"database_name": {
"description": "The name of the database to connect to.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_database_schemas",
"options_callback_kwargs": null,
"state_name": "database_name",
"title": "Database Name",
"tooltip": "Detail:\nNote: If you don\u2019t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.\n\nValidation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "string"
},
"table_name": {
"description": "The name of the table to use.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": "xperiflow.source.app.routines.pbm.store.conn.sqltable:SqlTabularConnection.get_tables",
"options_callback_kwargs": null,
"state_name": "table_name",
"title": "Table Name",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime.",
"type": "string"
}
},
"required": [
"database_resource",
"database_name",
"table_name"
],
"title": "SqlTabularConnection",
"type": "object"
},
"TabularConnection": {
"description": "A shared parameter base model dedication to tabular connections.",
"properties": {
"tabular_connection": {
"anyOf": [
{
"$ref": "#/$defs/SqlTabularConnection"
},
{
"$ref": "#/$defs/FileTabularConnection"
},
{
"$ref": "#/$defs/PartitionedFileTabularConnection"
}
],
"description": "The connection type to use to access the source data.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "connection",
"title": "Connection",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
}
},
"required": [
"tabular_connection"
],
"title": "TabularConnection",
"type": "object"
}
},
"properties": {
"datasource": {
"$ref": "#/$defs/TabularConnection",
"description": "Select the datasource containing observations to assign to clusters.",
"field_type": "input",
"input_component": {
"component_type": "combobox",
"show_search": true
},
"long_description": null,
"options_callback": null,
"options_callback_kwargs": null,
"state_name": "PredictDataSelection",
"title": "Prediction Datasource",
"tooltip": "Validation Constraints:\nThis input may be subject to other validation constraints at runtime."
}
},
"required": [
"datasource"
],
"title": "ClusteringAnalysisPredictParameters",
"type": "object"
}
Artifacts:
| Property | Type | Required | Description |
|---|---|---|---|
cluster_intersection | unknown | Yes | Parquet file containing data about the clustering intersections and which cluster they belong to. |
data_utilized | DataFrame | Yes | Parquet file containing the data utilized in the clustering predict method. |
Artifact Schema (JSON):
{
"additionalProperties": true,
"properties": {
"cluster_intersection": {
"description": "Parquet file containing data about the clustering intersections and which cluster they belong to.",
"io_factory_kwargs": {},
"preview_factory_kwargs": null,
"preview_factory_type": null,
"statistic_factory_kwargs": null,
"statistic_factory_type": null,
"title": "Clustering Intersection Results"
},
"data_utilized": {
"description": "Parquet file containing the data utilized in the clustering predict method.",
"io_factory_kwargs": {},
"preview_factory_kwargs": null,
"preview_factory_type": null,
"statistic_factory_kwargs": null,
"statistic_factory_type": null,
"title": "Data Utilized",
"type": "DataFrame"
}
},
"required": [
"cluster_intersection",
"data_utilized"
],
"title": "ClusteringPredictArtifacts",
"type": "object"
}
Developer Docs
Routine Typename: HDBScanClusteringAnalysis
| Method Name | Artifact Keys |
|---|---|
__init__ | N/A |
fit | cluster_intersection, cluster_descriptions, data_utilized |
predict | cluster_intersection, data_utilized |