Skip to main content

KalmanFilter

Versions

v1.0.0

Basic Information

Class Name: KalmanFilter

Title: Kalman Filter

Version: 1.0.0

Author: Luke Heberling, Drew Shea

Organization: OneStream

Creation Date: 2024-06-18

Default Routine Memory Capacity: 2 GB

Tags

Data Cleansing, Data Preprocessing, Time Series, Regression, Statistics

Description

Short Description

A Kalman Filter for cleansing time series data.

Long Description

The Kalman Filter excels in cleansing time series data by predicting and correcting estimates based on noisy measurements. It iteratively updates predictions as new data arrives, balancing the predicted state against new measurements to filter out noise and refine accuracy. This makes it ideal for applications like financial time series, where it smooths erratic data to reveal underlying trends. By modeling both the process and measurement noise, the Kalman Filter efficiently separates the signal from the noise, enhancing data analysis and decision-making accuracy in dynamic environments.

Use Cases

1. Handling ERP System Blackout Periods

Enterprise Resource Planning (ERP) systems often experience scheduled blackout periods for maintenance, leading to gaps in data collection. Kalman filters can effectively clean this disrupted data by predicting the missing values during blackout periods. By leveraging the filter’s ability to estimate unknown variables from noisy measurements, businesses can maintain the continuity and accuracy of their time series data. This ensures that subsequent forecasting models are not skewed by these intentional data gaps, providing more reliable insights for decision-making and operations.

2. Addressing COVID-19 Data Anomalies in Forecasting

The COVID-19 pandemic introduced significant anomalies in time series data, such as sudden drops in consumer demand or spikes in healthcare resource utilization. Kalman filters can be employed to smooth out these anomalies, identifying and filtering out extreme values that do not represent normal patterns. By integrating historical data trends and real-time observations, the Kalman filter helps in mitigating the impact of these outliers, allowing time series forecasting models to generate more accurate predictions. This is crucial for planning and resource allocation in uncertain times.

3. Cleaning Point-Based Anomalies with Kalman Filters

Time series data often contains point-based anomalies, such as sudden spikes or drops due to errors or unusual events. Kalman filters can detect and correct these anomalies by comparing observed values with expected values based on the underlying trend and seasonal patterns. By continuously updating its estimates, the Kalman filter can smooth out these irregularities, ensuring the integrity of the data used for forecasting. This process enhances the accuracy of predictive models, making them more robust against unexpected fluctuations in the data.

Routine Methods

1. Kalman Filter (Method)
  • Method: kalman_filter
    • Type: Method

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: No

    • Method Limits: With 100 GB of memory, this method completes in a couple hours when run on data containing 25,000 targets and 7.4 million rows. When scaled to 27,500 targets and 20.1 million rows, the method completes in several hours. These results show that runtime increases significantly as both the number of targets and total row volume grow, with larger datasets driving higher computational and memory demands.

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • Filter the provided input data.
    • Detailed Description:

      • Run a kalman smoother against the provided input data to impute missing values from the data.
    • Inputs:

      • Required Input
        • Source Data Definition: The source data definition.
          • Name: source_data_definition
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Time Series Source Data
          • Nested Model: Time Series Source Data
            • Required Input
              • Connection: The connection to the source data.
                • Name: data_connection
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be an instance of Tabular Connection
                • Nested Model: Tabular Connection
                  • Required Input
                    • Connection: The connection type to use to access the source data.
                      • Name: tabular_connection
                      • Tooltip:
                        • Validation Constraints:
                          • This input may be subject to other validation constraints at runtime.
                      • Type: Must be one of the following
                        • SQL Server Connection
                          • Required Input
                            • Database Resource: The name of the database resource to connect to.
                              • Name: database_resource
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                            • Database Name: The name of the database to connect to.
                              • Name: database_name
                              • Tooltip:
                                • Detail:
                                  • Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                            • Table Name: The name of the table to use.
                              • Name: table_name
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                        • MetaFileSystem Connection
                          • Required Input
                            • Connection Key: The MetaFileSystem connection key.
                              • Name: connection_key
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: MetaFileSystemConnectionKey
                            • File Path: The full file path to the file to ingest.
                              • Name: file_path
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                        • Partitioned MetaFileSystem Connection
                          • Required Input
                            • Connection Key: The MetaFileSystem connection key.
                              • Name: connection_key
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: MetaFileSystemConnectionKey
                            • File Type: The type of files to read from the directory.
                              • Name: file_type
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: FileExtensions_
                            • Directory Path: The full directory path containing partitioned tabular files.
                              • Name: directory_path
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
              • Dimension Columns: The columns to use as dimensions.
                • Name: dimension_columns
                • Tooltip:
                  • Validation Constraints:
                    • The input must have a minimum length of 1.
                    • This input may be subject to other validation constraints at runtime.
                • Type: list[str]
              • Date Column: The column to use as the date.
                • Name: date_column
                • Tooltip:
                  • Detail:
                    • The date column must in a DateTime readable format.
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: str
              • Value Column: The column to use as the value.
                • Name: value_column
                • Tooltip:
                  • Detail:
                    • The value column must be a numeric (int, float, double, decimal, etc.) column.
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: str
        • Acceptable Cleaning Values: The types of values allowed to be cleaned in the date range.
          • Name: clean_type
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: KalmanFilterAllowedCleanValues
      • Optional Input
        • Start Date: The start date of the data to filter.
          • Name: start_date
          • Tooltip:
            • Detail:
              • The start date should be earlier than the 'end_date' if decided to provide the start date. If left as 'None', The earliest date found in the dataset will be used.
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Optional[datetime]
        • End Date: The end date of the data to filter.
          • Name: end_date
          • Tooltip:
            • Detail:
              • The end date should be later than the 'start_date' if decided to provide the end date. If left as 'None', The latest date found in the dataset will be used.
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Optional[datetime]
        • Number of Seasons: The number of seasons in the data.
          • Name: n_seasons
          • Tooltip:
            • Detail:
              • The n_seasons parameter in the KalmanSmoother class is used to define the number of observations that complete one seasonal cycle. This is crucial when your time series data exhibits seasonal variations. Essentially, n_seasons tells the smoother how often the pattern repeats itself over time. For example, if you have monthly data with a pattern that repeats every year, you would set n_seasons to 12 (12 months in a year). If you have daily data with a weekly pattern, n_seasons would be 5-7.
            • Validation Constraints:
              • The input must be greater than or equal to 2.
              • This input may be subject to other validation constraints at runtime.
          • Type: Optional[int]
    • Artifacts:

      • Cleansed Data: The cleansed data.

        • Qualified Key Annotation: cleansed_data
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@cleansed_data/data_/data_<int>.parquet
            • A partitioned set of parquet files where each file will have no more than 1000000 rows.
      • New vs Old Plot: The new vs old plot.

        • Qualified Key Annotation: new_v_old_plot
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@new_v_old_plot/data_/plotly.pkl
            • A python plotly figure stored in a python pickle file. Note, this is a binary file type and is not readable in .NET.
          • artifacts_/@new_v_old_plot/data_/plotly.html
            • An interactive html representation of the plotly figure.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: KalmanFilter

Method NameArtifact Keys
kalman_filtercleansed_data, new_v_old_plot

Was this page helpful?