KalmanFilterV2
Versions
v1.0.0
Basic Information
Class Name: KalmanFilterV2
Title: Kalman Filter V2
Version: 1.0.0
Author: Ben DeGrieck, Evan Rasmussen
Organization: OneStream
Creation Date: 2024-07-26
Default Routine Memory Capacity: 2 GB
Tags
Data Cleansing, Data Preprocessing, Time Series, Regression, Statistics, Optimization
Description
Short Description
A Kalman Filter for cleansing time series data.
Long Description
The Kalman Filter excels in cleansing time series data by predicting and correcting estimates based on noisy measurements. It iteratively updates predictions as new data arrives, balancing the predicted state against new measurements to filter out noise and refine accuracy. This makes it ideal for applications like financial time series, where it filters erratic data to reveal underlying trends. By modeling both the process and measurement noise, the Kalman Filter efficiently separates the signal from the noise, enhancing data analysis and decision-making accuracy in dynamic environments.
Use Cases
1. Handling ERP System Blackout Periods
Enterprise Resource Planning (ERP) systems often experience scheduled blackout periods for maintenance, leading to gaps in data collection. Kalman filters can effectively clean this disrupted data by predicting the missing values during blackout periods. By leveraging the filter’s ability to estimate unknown variables from noisy measurements, businesses can maintain the continuity and accuracy of their time series data. This ensures that subsequent forecasting models are not skewed by these intentional data gaps, providing more reliable insights for decision-making and operations.
2. Addressing COVID-19 Data Anomalies in Forecasting
The COVID-19 pandemic introduced significant anomalies in time series data, such as sudden drops in consumer demand or spikes in healthcare resource utilization. Kalman filters can be employed to filter out these anomalies, identifying and filtering out extreme values that do not represent normal patterns. By integrating historical data trends and real-time observations, the Kalman filter helps in mitigating the impact of these outliers, allowing time series forecasting models to generate more accurate predictions. This is crucial for planning and resource allocation in uncertain times.
3. Cleaning Point-Based Anomalies with Kalman Filters
Time series data often contains point-based anomalies, such as sudden spikes or drops due to errors or unusual events. Kalman filters can detect and correct these anomalies by comparing observed values with expected values based on the underlying trend and seasonal patterns. By continuously updating its estimates, the Kalman filter can filter out these irregularities, ensuring the integrity of the data used for forecasting. This process enhances the accuracy of predictive models, making them more robust against unexpected fluctuations in the data.
Routine Methods
1. Init (Constructor)
- Method:
__init__-
Type: Constructor
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: This constructor method only initializes the routine instance and does not process data, so it has no performance limits.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Constructor method for the Kalman Filter V2 Routine.
-
Detailed Description:
- Define an instance of the Kalman Filter V2 Routine with the specified input parameters.
-
Inputs:
- Required Input
- Configuration Method: Decide if the hyperparameters will be automatically optimized or manually inputted.
- Name:
configuration_method - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- Auto Hyperparameter Configuration
- Required Input
- Auto Hyperparameters: Confirm the selection for letting the routine automatically determine the optimal hyperparameters by testing variations of hyperparameter configurations on subsets of the data.
- Name:
auto - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Literal
- Name:
- Auto Hyperparameters: Confirm the selection for letting the routine automatically determine the optimal hyperparameters by testing variations of hyperparameter configurations on subsets of the data.
- Required Input
- Manual Hyperparameter Configuration
- Required Input
- Component: Configure the component to be used in the KalmanSmoother. The options represent the patterns and dynamics present in the data.
- Name:
component - Tooltip:
- Detail:
- Each word in the option separated by an underscore is its own piece of the component.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: ComponentOptions_
- Name:
- Component: Configure the component to be used in the KalmanSmoother. The options represent the patterns and dynamics present in the data.
- Optional Input
- Seasons: The period of the seasonal component of the data.
- Name:
n_seasons - Tooltip:
- Detail:
- The n_seasons parameter in the KalmanSmoother class is used to define the number of observations that complete one seasonal cycle. For example, for monthly data with a pattern that repeats every year, n_seasons may be set to 12 (12 months in a year). Similarly, for daily data that exhibits a weekly pattern, n_seasons would be 7.
- Validation Constraints:
- The input must be greater than or equal to 2.
- The input must be less than or equal to 12.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[int]
- Name:
- Long Seasons: The period of the long seasonal component of the data.
- Name:
n_longseasons - Tooltip:
- Detail:
- The n_long_seasons parameter in the KalmanSmoother class is used to define the number of observations that complete one long-term seasonal cycle. For instance, with monthly data that exhibits a long-term pattern that repeats every five years, one may set n_long_seasons to 60 (12 months in a year times 5 years). Similarly, with weekly data maintaining a pattern that repeats every two years, n_long_seasons would be 104 (52 weeks in a year multiplied by 2 years).
- Validation Constraints:
- The input must be greater than or equal to 2.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[int]
- Name:
- Observation Noise: The noise level generated by the data measurement.
- Name:
observation_noise - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Optional[float]
- Name:
- Component Noise Dictionary: Define the noise level for each component in the data.
- Name:
component_noise - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Component Noise Configuration
- Nested Model: Component Noise Configuration
- Required Input
- Level Noise: The noise value for the level component.
- Name:
level - Tooltip:
- Detail:
- Note that the 'level' component is the base level of the data and will always be included in the 'component' parameter.
- Validation Constraints:
- The input must be greater than 0.0.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: float
- Name:
- Trend Noise: The noise value for the trend component.
- Name:
trend - Tooltip:
- Detail:
- If 'trend' is not included in the 'component' parameter from step two, this value will be ignored.
- Validation Constraints:
- The input must be greater than 0.0.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: float
- Name:
- Season Noise: The noise value for the season component.
- Name:
season - Tooltip:
- Detail:
- If 'season' is not included in the 'component' parameter from step two, this value will be ignored.
- Validation Constraints:
- The input must be greater than 0.0.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: float
- Name:
- Longseason Noise: The noise value for the longseason component.
- Name:
longseason - Tooltip:
- Detail:
- If 'longseason' is not included in the 'component' parameter from step two, this value will be ignored.
- Validation Constraints:
- The input must be greater than 0.0.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: float
- Name:
- Level Noise: The noise value for the level component.
- Required Input
- Name:
- Seasons: The period of the seasonal component of the data.
- Required Input
- Auto Hyperparameter Configuration
- Name:
- Configuration Method: Decide if the hyperparameters will be automatically optimized or manually inputted.
- Required Input
-
Artifacts: No artifacts are returned by this method
-
2. Fit (Method)
- Method:
fit-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: Memory usage scales with dataset size and number of targets. For a dataset with 500 targets and ~500K rows, this method is expected to complete in around 21 minutes with 10GB of memory allocated. For datasets with 7K-10K targets and ~1M rows, this method may timeout or require 40GB+ of memory allocated.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Fit the Kalman Filter model to the input data.
-
Detailed Description:
- If the user has enabled autoconfiguration, this method will search for and save the most optimal hyperparameters for each target in the dataset. Otherwise, if the user has manually configured the hyperparameters, the method will store the static hyperparameters provided.
-
Inputs:
- Required Input
- Source Data Definition: The time series source data definition.
- Name:
source_data_definition - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Time Series Source Data
- Nested Model: Time Series Source Data
- Required Input
- Connection: The connection to the source data.
- Name:
data_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Dimension Columns: The columns to use as dimensions.
- Name:
dimension_columns - Tooltip:
- Validation Constraints:
- The input must have a minimum length of 1.
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Date Column: The column to use as the date.
- Name:
date_column - Tooltip:
- Detail:
- The date column must in a DateTime readable format.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Value Column: The column to use as the value.
- Name:
value_column - Tooltip:
- Detail:
- The value column must be a numeric (int, float, double, decimal, etc.) column.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Connection: The connection to the source data.
- Required Input
- Name:
- Source Data Definition: The time series source data definition.
- Required Input
-
Artifacts: No artifacts are returned by this method
-
3. Predict (Method)
- Method:
predict-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: Memory usage scales with dataset size and number of targets. For a dataset with 500 targets, this method is expected to complete in around 18 minutes with 10GB of memory allocated. For datasets with 10K targets, this method may encounter memory errors with 10GB or 15GB of memory allocated.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Cleanse user-specified data with a Kalman Filter.
-
Detailed Description:
- Configure how values within the dataset should be cleaned by defining the clean type and dimension filters. Users may decide to clean any values within the dataset, where the Kalman Filter will smooth existing values (missing or not) within the specified date range and dimension filters. Alternatively, users may choose to only clean missing values, where the Kalman Filter will only smooth missing values from the dataset that are missing or previously exist as null values. Important to note that only missing values within the specified date range and dimension filters will be added and cleaned in this case. The output dataframe will include a column containing the original values with the name of the value column appended with 'Original' and a separate column containing the cleaned values with the name of the value column appended with 'Kalman Filtered'. The plotly graph will show the original and cleaned values across the full range of the data.
-
Inputs:
- Required Input
- Source Data Definition: The time series source data definition.
- Name:
source_data_definition - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Time Series Source Data
- Nested Model: Time Series Source Data
- Required Input
- Connection: The connection to the source data.
- Name:
data_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Dimension Columns: The columns to use as dimensions.
- Name:
dimension_columns - Tooltip:
- Validation Constraints:
- The input must have a minimum length of 1.
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Date Column: The column to use as the date.
- Name:
date_column - Tooltip:
- Detail:
- The date column must in a DateTime readable format.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Value Column: The column to use as the value.
- Name:
value_column - Tooltip:
- Detail:
- The value column must be a numeric (int, float, double, decimal, etc.) column.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Connection: The connection to the source data.
- Required Input
- Name:
- Acceptable Cleaning Values: Any: clean any values (missing or not) between inputted time range, Missing: clean only missing values between time range.
- Name:
clean_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: KalmanFilterAllowedCleanValues_
- Name:
- Dimension Filters: The dimensions within the data to apply the filter to.
- Name:
dimension_filters - Tooltip:
- Detail:
- Leave as None if you want to apply the filter to all of the data.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: list[DimensionFilterParameters]
- Name:
- Source Data Definition: The time series source data definition.
- Optional Input
- Start Date: The start date of the data to filter.
- Name:
start_date - Tooltip:
- Detail:
- The Start Date should be earlier than the End Date if used. If unused, the earliest date found in the dataset will be used.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[datetime]
- Name:
- End Date: The end date of the data to filter.
- Name:
end_date - Tooltip:
- Detail:
- The End Date should be later than the Start Date if used. If unused, the latest date found in the dataset will be used.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[datetime]
- Name:
- Start Date: The start date of the data to filter.
- Required Input
-
Artifacts:
-
Cleansed Data: The Kalman cleansed data including the original values and the Kalman filtered values.
- Qualified Key Annotation:
cleansed_data - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cleansed_data/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
New vs Old Plot: The plot overlaying the summed original values and the summed Kalman filtered values over time.
- Qualified Key Annotation:
new_v_old_plot - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@new_v_old_plot/data_/plotly.pkl- A python plotly figure stored in a python pickle file. Note, this is a binary file type and is not readable in .NET.
artifacts_/@new_v_old_plot/data_/plotly.html- An interactive html representation of the plotly figure.
- Qualified Key Annotation:
-
-
Interface Definitions
No interface definitions found for this routine
Developer Docs
Routine Typename: KalmanFilterV2
| Method Name | Artifact Keys |
|---|---|
__init__ | N/A |
fit | N/A |
predict | cleansed_data, new_v_old_plot |