KalmanFilter
Versions
v1.0.0
Basic Information
Class Name: KalmanFilter
Title: Kalman Filter
Version: 1.0.0
Author: Luke Heberling, Drew Shea
Organization: OneStream
Creation Date: 2024-06-18
Default Routine Memory Capacity: 2 GB
Tags
Data Cleansing, Data Preprocessing, Time Series, Regression, Statistics
Description
Short Description
A Kalman Filter for cleansing time series data.
Long Description
The Kalman Filter excels in cleansing time series data by predicting and correcting estimates based on noisy measurements. It iteratively updates predictions as new data arrives, balancing the predicted state against new measurements to filter out noise and refine accuracy. This makes it ideal for applications like financial time series, where it smooths erratic data to reveal underlying trends. By modeling both the process and measurement noise, the Kalman Filter efficiently separates the signal from the noise, enhancing data analysis and decision-making accuracy in dynamic environments.
Use Cases
1. Handling ERP System Blackout Periods
Enterprise Resource Planning (ERP) systems often experience scheduled blackout periods for maintenance, leading to gaps in data collection. Kalman filters can effectively clean this disrupted data by predicting the missing values during blackout periods. By leveraging the filter’s ability to estimate unknown variables from noisy measurements, businesses can maintain the continuity and accuracy of their time series data. This ensures that subsequent forecasting models are not skewed by these intentional data gaps, providing more reliable insights for decision-making and operations.
2. Addressing COVID-19 Data Anomalies in Forecasting
The COVID-19 pandemic introduced significant anomalies in time series data, such as sudden drops in consumer demand or spikes in healthcare resource utilization. Kalman filters can be employed to smooth out these anomalies, identifying and filtering out extreme values that do not represent normal patterns. By integrating historical data trends and real-time observations, the Kalman filter helps in mitigating the impact of these outliers, allowing time series forecasting models to generate more accurate predictions. This is crucial for planning and resource allocation in uncertain times.
3. Cleaning Point-Based Anomalies with Kalman Filters
Time series data often contains point-based anomalies, such as sudden spikes or drops due to errors or unusual events. Kalman filters can detect and correct these anomalies by comparing observed values with expected values based on the underlying trend and seasonal patterns. By continuously updating its estimates, the Kalman filter can smooth out these irregularities, ensuring the integrity of the data used for forecasting. This process enhances the accuracy of predictive models, making them more robust against unexpected fluctuations in the data.
Routine Methods
1. Kalman Filter (Method)
- Method:
kalman_filter-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: With 100 GB of memory, this method completes in a couple hours when run on data containing 25,000 targets and 7.4 million rows. When scaled to 27,500 targets and 20.1 million rows, the method completes in several hours. These results show that runtime increases significantly as both the number of targets and total row volume grow, with larger datasets driving higher computational and memory demands.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Filter the provided input data.
-
Detailed Description:
- Run a kalman smoother against the provided input data to impute missing values from the data.
-
Inputs:
- Required Input
- Source Data Definition: The source data definition.
- Name:
source_data_definition - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Time Series Source Data
- Nested Model: Time Series Source Data
- Required Input
- Connection: The connection to the source data.
- Name:
data_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Dimension Columns: The columns to use as dimensions.
- Name:
dimension_columns - Tooltip:
- Validation Constraints:
- The input must have a minimum length of 1.
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Date Column: The column to use as the date.
- Name:
date_column - Tooltip:
- Detail:
- The date column must in a DateTime readable format.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Value Column: The column to use as the value.
- Name:
value_column - Tooltip:
- Detail:
- The value column must be a numeric (int, float, double, decimal, etc.) column.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Connection: The connection to the source data.
- Required Input
- Name:
- Acceptable Cleaning Values: The types of values allowed to be cleaned in the date range.
- Name:
clean_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: KalmanFilterAllowedCleanValues
- Name:
- Source Data Definition: The source data definition.
- Optional Input
- Start Date: The start date of the data to filter.
- Name:
start_date - Tooltip:
- Detail:
- The start date should be earlier than the 'end_date' if decided to provide the start date. If left as 'None', The earliest date found in the dataset will be used.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[datetime]
- Name:
- End Date: The end date of the data to filter.
- Name:
end_date - Tooltip:
- Detail:
- The end date should be later than the 'start_date' if decided to provide the end date. If left as 'None', The latest date found in the dataset will be used.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[datetime]
- Name:
- Number of Seasons: The number of seasons in the data.
- Name:
n_seasons - Tooltip:
- Detail:
- The n_seasons parameter in the KalmanSmoother class is used to define the number of observations that complete one seasonal cycle. This is crucial when your time series data exhibits seasonal variations. Essentially, n_seasons tells the smoother how often the pattern repeats itself over time. For example, if you have monthly data with a pattern that repeats every year, you would set n_seasons to 12 (12 months in a year). If you have daily data with a weekly pattern, n_seasons would be 5-7.
- Validation Constraints:
- The input must be greater than or equal to 2.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[int]
- Name:
- Start Date: The start date of the data to filter.
- Required Input
-
Artifacts:
-
Cleansed Data: The cleansed data.
- Qualified Key Annotation:
cleansed_data - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cleansed_data/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
New vs Old Plot: The new vs old plot.
- Qualified Key Annotation:
new_v_old_plot - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@new_v_old_plot/data_/plotly.pkl- A python plotly figure stored in a python pickle file. Note, this is a binary file type and is not readable in .NET.
artifacts_/@new_v_old_plot/data_/plotly.html- An interactive html representation of the plotly figure.
- Qualified Key Annotation:
-
-
Interface Definitions
No interface definitions found for this routine
Developer Docs
Routine Typename: KalmanFilter
| Method Name | Artifact Keys |
|---|---|
kalman_filter | cleansed_data, new_v_old_plot |