TimeSeriesCleaning
Versions
v1.0.0
Basic Information
Class Name: TimeSeriesCleaning
Title: Time Series Cleaning
Version: 1.0.0
Author: Evan Rasmussen
Organization: OneStream
Creation Date: 2024-04-28
Default Routine Memory Capacity: 2.0 GB
Tags
Data Cleansing, Time Series, Anomaly, Data Preprocessing
Description
Short Description
Clean anomalies from time series data with this routine.
Long Description
This routine is use to clean certain datapoints within a time series dataset. Some of the options to clean the anomalous data include cleaning the values to zero, cleaning the values to the mean, cleaning them with a Kalman filter, or using a statistical Arima model to predict values. The dataframe returned from this routine includes two columns representing the original value. One column will be appended with '_cleaned' (representing the column with all new values along with original values that were not changed) and another with '_original' (representing the value column exactly as it previously was).
Use Cases
1. Clean Anomalous Data
A user recently completed a run of the Anomaly Arena Routine on their dataset and successfully identified several true anomalies. The user wishes to update these values such that they are no longer considered anomalous. The user may navigate to the cleanse page of ADR (anomaly detection solution) in OneStream to configure the cleaning definitions for the identified anomalies. Since the user wishes to replace these bad datapoints with more representative values, they decide to utilize the Arima cleaning method for every identified anomaly (or every row of the cleaning manifest table they have configured within the solution). This Arima method will learn from datapoints prior to a given anomaly (identified by the cleaning manifest table) and predict a new value. Upon completion of this routine run, the user then navigates to the visualize page of ADR to analyze the results from the cleaning. It's possible the user notices some of the predictions do not align with what they expected. In this case, the user may revisit the cleaning definition page to make adjustments to the cleaning manifest table. Potential solutions may include changing the start or end dates of values to clean, changing the cleaning method, or even removing the row entirely. The user then clicks the cleanse button to kick off a new run of the Time Series Cleaning Routine. When the results are acceptable, the user can take the cleansed dataset and utilize it in their downstream processes, assured that it is a more accurate representation of their metrics.
Routine Methods
1. Clean Anomalies (Method)
- Method:
clean_anomalies-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: This method depends on both the source dataset and the cleaning manifest, and its runtime is primarily determined by the size of each. With 10,000 targets, 7.3 million source rows, and a manifest of ~20,000 cleaning actions, the method completes in a couple hours, but it fails when scaled to 15,000 targets, 10.95 million source rows, and ~30,000 manifest actions. All tests were conducted with 100 GB of memory. The error occurs roughly 2 hours after the data is successfully partitioned into subsets and displays as a worker lost error. Simple cleaning methods—such as mean, zero, and interpolate—run quickly, whereas statistical methods like ARIMA and Kalman are far slower and become the dominant factor in overall execution time as dataset size increases.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- A method to clean anomalies from a time series dataset.
-
Detailed Description:
- This method takes a source data connection to clean values on and an accompanying cleaning manifest table definition containing relevant information about the values to clean and cleaning mechanisms to use. These mechanisms can be completely mixed and matched as the user sees fit. The key is that each row of the cleaning manifest table contains a value or range of values to clean, identified by date ranges and dimension values.
-
Inputs:
- Required Input
- Source Data Connection: The connection information source data.
- Name:
source_data_definition - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Time Series Source Data
- Nested Model: Time Series Source Data
- Required Input
- Connection: The connection to the source data.
- Name:
data_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Dimension Columns: The columns to use as dimensions.
- Name:
dimension_columns - Tooltip:
- Validation Constraints:
- The input must have a minimum length of 1.
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Date Column: The column to use as the date.
- Name:
date_column - Tooltip:
- Detail:
- The date column must in a DateTime readable format.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Value Column: The column to use as the value.
- Name:
value_column - Tooltip:
- Detail:
- The value column must be a numeric (int, float, double, decimal, etc.) column.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Connection: The connection to the source data.
- Required Input
- Name:
- Cleaning Manifest Table Connection: The connection to the table defining the data to clean and the cleaning methods to apply.
- Name:
cleaning_manifest_definition - Long Description: This table is expected to have the following columns: 'CleaningMethod', 'StartDate', 'EndDate', 'Dim1', 'Dim2', 'Dim3', 'Dim4', 'Dim5', 'Dim6', 'Dim7', and 'Dim8'.
- Tooltip:
- Detail:
- This table is expected to have the following columns: 'CleaningMethod', 'StartDate', 'EndDate', 'Dim1', 'Dim2', 'Dim3', 'Dim4', 'Dim5', 'Dim6', 'Dim7', and 'Dim8'.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Dimension Column Mapping: A mapping of the 'Dim1'-'Dim8' columns in the cleaning manifest table to the source data dimension columns.
- Name:
dimension_column_mapping - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Cleaning Dimension Column Mapping
- Nested Model: Cleaning Dimension Column Mapping
- Optional Input
- Dim1: The name of the column from the source data that corresponds to Dim1 of the manifest table.
- Name:
Dim1 - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Optional[str]
- Name:
- Dim2: The name of the column from the source data that corresponds to Dim2 of the manifest table.
- Name:
Dim2 - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Optional[str]
- Name:
- Dim3: The name of the column from the source data that corresponds to Dim3 of the manifest table.
- Name:
Dim3 - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Optional[str]
- Name:
- Dim4: The name of the column from the source data that corresponds to Dim4 of the manifest table.
- Name:
Dim4 - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Optional[str]
- Name:
- Dim5: The name of the column from the source data that corresponds to Dim5 of the manifest table.
- Name:
Dim5 - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Optional[str]
- Name:
- Dim6: The name of the column from the source data that corresponds to Dim6 of the manifest table.
- Name:
Dim6 - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Optional[str]
- Name:
- Dim7: The name of the column from the source data that corresponds to Dim7 of the manifest table.
- Name:
Dim7 - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Optional[str]
- Name:
- Dim8: The name of the column from the source data that corresponds to Dim8 of the manifest table.
- Name:
Dim8 - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Optional[str]
- Name:
- Dim1: The name of the column from the source data that corresponds to Dim1 of the manifest table.
- Optional Input
- Name:
- Custom Cleaning Definitions: Custom cleaning definitions are not currently supported.
- Name:
custom_cleaning_definitions - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Data to Return: The data to return after cleaning.
- Name:
data_to_return - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- Date Range Specific Data
- Optional Input
- Start Date: The start date for the data to return (inclusive).
- Name:
start_date - Tooltip:
- Detail:
- If no start date is provided, the data returned will include all data from the beginning of the source data to the end date.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[datetime]
- Name:
- End Date: The end date for the data to return (inclusive).
- Name:
end_date - Tooltip:
- Detail:
- If no end date is provided, the data returned will include all data from the start date to the end of the source data.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[datetime]
- Name:
- Start Date: The start date for the data to return (inclusive).
- Optional Input
- Cleaned Data Only
- Required Input
- Cleaned Data Only: Return only source data that had values cleaned.
- Name:
cleaned_data_only - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Literal
- Name:
- Cleaned Data Only: Return only source data that had values cleaned.
- Required Input
- Full Source and Cleaned Data
- Required Input
- Full Source and Cleaned Data: Return both the source and cleaned data.
- Name:
full_source_and_cleaned_data - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Literal
- Name:
- Full Source and Cleaned Data: Return both the source and cleaned data.
- Required Input
- Date Range Specific Data
- Name:
- Source Data Connection: The connection information source data.
- Optional Input
- Incremental Cleaning: Whether to perform incremental cleaning. If False, previously cleaned values are ignored in downstream calculations.
- Name:
incremental_cleaning - Long Description: Example: If set to True and a value is cleaned to the mean, this cleaned value will be included in any mean calculations to clean values from later dates, rather than the original value.
- Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Optional[bool]
- Name:
- Incremental Cleaning: Whether to perform incremental cleaning. If False, previously cleaned values are ignored in downstream calculations.
- Required Input
-
Artifacts:
-
Cleaned Dataframe: The dataframe with updated values based on the cleaning definitions.
- Qualified Key Annotation:
cleaned_dataframe - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cleaned_dataframe/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
Cleaning Summarization Report: A PDF summarization report overlaying original and cleaned values.
- Qualified Key Annotation:
cleaning_report - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cleaning_report/data_/document.pdf- A pdf variant of the html file. Please note the interactivity that may be found in the html is lost within the pdf variant.
artifacts_/@cleaning_report/data_/html_content.html- The html content.
- Qualified Key Annotation:
-
-
Interface Definitions
No interface definitions found for this routine
Developer Docs
Routine Typename: TimeSeriesCleaning
| Method Name | Artifact Keys |
|---|---|
clean_anomalies | cleaned_dataframe, cleaning_report |