RPETargetSubsampler

Versions

1.0.0

v1.0.0

Basic Information

Class Name: RPETargetSubsampler

Title: Target Sub-Sampler

Version: 1.0.0

Author: Christian Reyes-Avina

Organization: OneStream

Creation Date: 2025-03-19

Default Routine Memory Capacity: 2.0 GB

Description

Short Description

A routine to sub-sample targets from a dataset using various sampling methods.

Long Description

This routine provides methods to sub-sample targets from a dataset using various sampling methods, such as dynamic time warping, semi-random sampling, and significance breakdown sampling. In dynamic time warping, the routine uses hierarchical clustering to group similar time series data and selects medoids from each cluster. In semi-random sampling, the routine randomly selects targets from the dataset based on user-defined dimensions and values. In significance breakdown sampling, the routine selects targets based on the significance of their aggregated values. This routine supports flexible dimension grouping, and dynamic control over how many targets to retain — either via a fixed count or a percentile-based threshold. By leveraging this sub-sampling routine, users can ensure that their models train on the most informative and varied subsets of time series data.

Use Cases

1. Sub-Sample Targets from a large dataset.

In large-scale time series datasets, analyzing or modeling every individual time series can be computationally expensive and often is unnecessary. This routine enables users to intelligently downsample their dataset by identifying the most representative time series targets using Dynamic Time Warping (DTW), Significance Breakdown Sampling, or Semi-Random Sampling. For example, a dataset with daily sales from 2,000 stores across 12 countries can be reduced to 200 stores that capture the core behavioral patterns across geographies and store types or other dimensions. This is particularly useful for model prototyping, time series forecasting, or simulation scenarios where training time, cost, or complexity needs to be minimized without compromising data representativeness.

Routine Methods

1. Dynamic Time Warping (Method)

Method: dynamic_time_warping
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: Dynamic time warping is computationally expensive, yielding significantly longer runtimes when compared to the other methods on this routine. On a dataset with 8K targets, 1.5M rows, and 6 columns, this method completed in about 45 minutes with 35GB of memory allocated. However, on a dataset with 15K targets, 7.5M rows, and 6 columns, the routine run timed out after 5 hours with no updates. This method scales quadratically, which means the cost grows very quickly as more targets are added.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Selects most representative targets using Dynamic Time Warping (DTW) and hierarchical clustering.
- Detailed Description:
  - This method performs sub-selection by analyzing the similarity between time series across different targets. It begins by pivoting the dataset into a time series matrix based on the specified dimensions, filling any missing values using the provided fill method. The resulting time series are then standardized to ensure consistent scale. Next, the method computes a DTW distance matrix representing pairwise similarity between all time series. Hierarchical clustering is applied to this matrix to group similar time series together. The number of clusters can be determined in three ways: - If neither the number of clusters nor the percentile is specified, the method uses the elbow method on intra-cluster distances to determine the optimal number of clusters. - If the number of clusters is specified, the method will generate up to that number of clusters. - If a percentile is specified, the number of clusters is derived as a percentage of the total targets (minimum 2). Within each cluster, the medoid is selected as the most representative target—defined as the time series with the lowest total DTW distance to others in its cluster. The method returns: - A filtered dataset containing only the selected medoid targets. - The full dataset with two new columns: - target_is_in_subselection: indicating if a target is part of the selected subset. - cluster: the cluster assignment for each target. - A list of the selected target names. The routine also handles cases where the specified dimension(s) generate a pivot table with only one target column,
- Inputs:
  - Required Input
    - Source Data Definition: The source data definition.
      - Name: source_data_definition
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Time Series Source Data
      - Nested Model: Time Series Source Data
        
        Required Input
        
        Connection: The connection to the source data.
        
        Name: data_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be an instance of Tabular Connection
        
        Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Dimension Columns: The columns to use as dimensions.
        
        Name: dimension_columns
        
        Tooltip:
        
        Validation Constraints:
        
        The input must have a minimum length of 1.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
        
        Date Column: The column to use as the date.
        
        Name: date_column
        
        Tooltip:
        
        Detail:
        
        The date column must in a DateTime readable format.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Value Column: The column to use as the value.
        
        Name: value_column
        
        Tooltip:
        
        Detail:
        
        The value column must be a numeric (int, float, double, decimal, etc.) column.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Dimension Sub-Selections: The dimension column(s) to group the time series.
      - Name: dimension_subselections
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Fill Method: The method to use to fill in missing data when pivoting time series.
      - Name: fill_method
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: FillMethod_
    - Custom Fill Value: The custom value to use when filling missing values in the time series pivot table.
      - Name: custom_fill_value
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: int | float | NoneType
  - Optional Input
    - Number of Clusters: The number of clusters to generate.
      - Name: num_clusters
      - Tooltip:
        
        Validation Constraints:
        
        The input must be greater than 0.
        
        This input may be subject to other validation constraints at runtime.
      - Type: Optional[int]
    - Percentile: The percentile of clusters to retain for cluster assignment.
      - Name: percentile
      - Tooltip:
        
        Validation Constraints:
        
        The input must be greater than 0.
        
        The input must be less than or equal to 100.
        
        This input may be subject to other validation constraints at runtime.
      - Type: Optional[float]
- Artifacts:
  - List of Target Names: The subselection of targets.
    - Qualified Key Annotation: list_of_target_names
    - Aggregate Artifact: False
    - In-Memory Json Accessible: True
    - File Annotations:
      - artifacts_/@list_of_target_names/data_/list.json
        
        A json list object stored in a json file.
  - Full Target Dataset: The original dataset with an additional column stating whether the target is included in the subselection sample.
    - Qualified Key Annotation: full_target_dataset
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@full_target_dataset/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Filtered Target Dataset: The original dataset filtered down to include only the selected targets
    - Qualified Key Annotation: filtered_target_dataset
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@filtered_target_dataset/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

2. Semi Random Sample (Method)

Method: semi_random_sample
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: This routine method easily supports substantial volumes of data. This method has been tested with a 40K target dataset containing 29M rows and 6 columns, completing in just 5 minutes. There is very minimal difference in compute and runtime when utilizing the target count input parameter vs. the dimension subselections input parameter.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Performs semi-random sampling of targets from a dataset based on user-defined dimensions and values.
- Detailed Description:
  - This method enables sub-sampling of targets either through dimension-based filtering or broad random selection. If the user specifies dimension/value pairs along with a percentage, the method filters the dataset based on each pair and randomly samples the specified percentage of targets from the matching group(s). If no dimension filters are provided, the method randomly selects a fixed number of unique targets from the entire dataset. This approach is useful for testing or prototyping with a smaller, representative subset of targets, without relying on clustering or significance metrics.
- Inputs:
  - Required Input
    - Source Data Definition: The source data definition.
      - Name: source_data_definition
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Time Series Source Data
      - Nested Model: Time Series Source Data
        
        Required Input
        
        Connection: The connection to the source data.
        
        Name: data_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be an instance of Tabular Connection
        
        Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Dimension Columns: The columns to use as dimensions.
        
        Name: dimension_columns
        
        Tooltip:
        
        Validation Constraints:
        
        The input must have a minimum length of 1.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
        
        Date Column: The column to use as the date.
        
        Name: date_column
        
        Tooltip:
        
        Detail:
        
        The date column must in a DateTime readable format.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Value Column: The column to use as the value.
        
        Name: value_column
        
        Tooltip:
        
        Detail:
        
        The value column must be a numeric (int, float, double, decimal, etc.) column.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Dimension Sub-Selections: The dimensions to sub-select and randomly sample.
      - Name: dimensions_subselections
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[DimensionSubselection]
  - Optional Input
    - Number of Targets: The number of targets to include in random sample.
      - Name: num_targets
      - Tooltip:
        
        Validation Constraints:
        
        The input must be greater than 0.
        
        This input may be subject to other validation constraints at runtime.
      - Type: Optional[int]
- Artifacts:
  - List of Target Names: The subselection of targets.
    - Qualified Key Annotation: list_of_target_names
    - Aggregate Artifact: False
    - In-Memory Json Accessible: True
    - File Annotations:
      - artifacts_/@list_of_target_names/data_/list.json
        
        A json list object stored in a json file.
  - Full Target Dataset: The original dataset with an additional column stating whether the target is included in the subselection sample.
    - Qualified Key Annotation: full_target_dataset
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@full_target_dataset/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Filtered Target Dataset: The original dataset filtered down to include only the selected targets
    - Qualified Key Annotation: filtered_target_dataset
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@filtered_target_dataset/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

3. Significance Breakdown Sample (Method)

Method: significance_breakdown_sample
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: This routine method easily supports substantial volumes of data. This method has been tested with a 40K target dataset containing 29M rows and 6 columns, completing in just 5 minutes. There is very minimal difference in compute and runtime when utilizing the global significance sampling input parameter vs. the dimension based significance subselections input parameter. These limits are very similar to those of the Semi Random Sample method.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Significance breakdown sampling of targets
- Detailed Description:
  - This method supports two types of significance-based sub-sampling: 1. Dimension-Based Significance Subselection: Users provide specific dimension names, corresponding values, and a significance percentage. The method filters the dataset by each (dimension, value) pair, aggregates the value column, and selects the top targets whose cumulative values fall within the specified significance threshold. 2.Global Significance Sampling: Users provide only a significance percentage. The method aggregates the value column across all targets and selects the top targets whose cumulative values fall within the threshold, regardless of dimension. Both approaches aim to retain the most impactful targets based on their contribution to the overall value metric.
- Inputs:
  - Required Input
    - Source Data Definition: The source data definition.
      - Name: source_data_definition
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Time Series Source Data
      - Nested Model: Time Series Source Data
        
        Required Input
        
        Connection: The connection to the source data.
        
        Name: data_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be an instance of Tabular Connection
        
        Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Dimension Columns: The columns to use as dimensions.
        
        Name: dimension_columns
        
        Tooltip:
        
        Validation Constraints:
        
        The input must have a minimum length of 1.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
        
        Date Column: The column to use as the date.
        
        Name: date_column
        
        Tooltip:
        
        Detail:
        
        The date column must in a DateTime readable format.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Value Column: The column to use as the value.
        
        Name: value_column
        
        Tooltip:
        
        Detail:
        
        The value column must be a numeric (int, float, double, decimal, etc.) column.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Significance Breakdown: The significance breakdown to use.
      - Name: significance_breakdown
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be one of the following
        
        Broad Significance Sample
        
        Required Input
        
        Significance Percentage: The significance percentage of the targets to include in the sub-selection.
        
        Name: significance_percentage
        
        Tooltip:
        
        Validation Constraints:
        
        The input must be greater than 0.
        
        The input must be less than or equal to 100.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: int | float
        
        Significance Subselection Sample
        
        Required Input
        
        Significance Dimension Sub-Selections: The dimensions to sub-select on.
        
        Name: significance_dimension_subselections
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[SignificanceDimensionSubselection]
- Artifacts:
  - List of Target Names: The subselection of targets.
    - Qualified Key Annotation: list_of_target_names
    - Aggregate Artifact: False
    - In-Memory Json Accessible: True
    - File Annotations:
      - artifacts_/@list_of_target_names/data_/list.json
        
        A json list object stored in a json file.
  - Full Target Dataset: The original dataset with an additional column stating whether the target is included in the subselection sample.
    - Qualified Key Annotation: full_target_dataset
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@full_target_dataset/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Filtered Target Dataset: The original dataset filtered down to include only the selected targets
    - Qualified Key Annotation: filtered_target_dataset
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@filtered_target_dataset/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: RPETargetSubsampler

Method Name	Artifact Keys
`dynamic_time_warping`	list_of_target_names, full_target_dataset, filtered_target_dataset
`semi_random_sample`	list_of_target_names, full_target_dataset, filtered_target_dataset
`significance_breakdown_sample`	list_of_target_names, full_target_dataset, filtered_target_dataset

Versions​

v1.0.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. Sub-Sample Targets from a large dataset.​

Routine Methods​

1. Dynamic Time Warping (Method)​

2. Semi Random Sample (Method)​

3. Significance Breakdown Sample (Method)​

Interface Definitions​

Developer Docs​