Skip to main content

NumericDataFillRoutine

Versions

v1.0.0

Basic Information

Class Name: NumericDataFillRoutine

Title: Numeric Data Fill

Version: 1.0.0

Author: Evan Rasmussen

Organization: OneStream

Creation Date: 2024-04-05

Default Routine Memory Capacity: 2.0 GB

Tags

Data Transformation, Interpretability, Data Preprocessing, Data Cleansing

Description

Short Description

A routine to fill missing values in a dataset.

Long Description

This routine fills the missing values in a column or columns with the specified fill strategy. The fill strategy can be one of the following: fill with zero, fill with mean, fill with median, fill with mode, fill with minimum value, fill with maximum value, fill with a custom value, fill forward, or fill backward. The forward fill strategy will fill the missing values with the last non-null value for the matching dimensions. Important to note that these values will not just be the most recent value in the value column, but the most recent value for the target dimensions. The same can be said for the backward fill strategy. Because of this, it is possible that the final filled values may still contain null values. The custom fill value can only be specified when the fill with custom value strategy is selected. This routine is meant to be used only on numeric columns.

Use Cases

1. Preprocess Data

Data preprocessing is a critical step in the data analysis and machine learning pipeline, serving to ensure that datasets are primed for optimal performance. A common challenge encountered during this phase is the presence of missing values, which can significantly impede the effectiveness of analytical models and algorithms. To address this issue, data preprocessing includes techniques for accurately imputing or filling missing values, thereby creating a more complete and meaningful dataset. This routine provides a range of solutions for filling null values including zero, mean, median, mode, min, max, custom value, forward fill, and backward fill. For example, one may have a dataset containing values missing from a 3 week period when a new ERP system was implemented. Using the numeric data fill routine, one may elect to fill these missing values with the mean value of the column, or with the value from the most recent entry using a forward fill. By leveraging these strategies, analysts can ensure that their datasets are complete and prepared for further analysis and modeling.

Routine Methods

1. Numeric Data Fill (Method)
  • Method: numeric_data_fill
    • Type: Method

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: No

    • Method Limits: This method has been tested with three datasets of different sizes. The first dataset with 10k targets and 1.1M rows completed in 29 seconds with 2GB of memory. The second dataset with 50k targets and 7.5M rows completed in roughly 4 minutes with 10GB of memory. The third dataset with 40k targets and 29.2M rows completed in about 5 hours with 10 GB of memory. About 15% of the rows in selected column for each dataset were to be filled.

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • This routine method is used to fill null values in a dataset with the specified fill strategy.
    • Detailed Description:

      • The fill strategy can be one of the following: fill with zero, fill with mean, fill with median, fill with mode, fill with min, fill with max, fill with custom value, fill forward, or fill backward. The custom value can only be specified when the Fill with custom value strategy is selected.
    • Inputs:

      • Required Input
        • Source Data Definition: The source data definition.
          • Name: source_data_definition
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Time Series Source Data
          • Nested Model: Time Series Source Data
            • Required Input
              • Connection: The connection to the source data.
                • Name: data_connection
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be an instance of Tabular Connection
                • Nested Model: Tabular Connection
                  • Required Input
                    • Connection: The connection type to use to access the source data.
                      • Name: tabular_connection
                      • Tooltip:
                        • Validation Constraints:
                          • This input may be subject to other validation constraints at runtime.
                      • Type: Must be one of the following
                        • SQL Server Connection
                          • Required Input
                            • Database Resource: The name of the database resource to connect to.
                              • Name: database_resource
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                            • Database Name: The name of the database to connect to.
                              • Name: database_name
                              • Tooltip:
                                • Detail:
                                  • Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                            • Table Name: The name of the table to use.
                              • Name: table_name
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                        • MetaFileSystem Connection
                          • Required Input
                            • Connection Key: The MetaFileSystem connection key.
                              • Name: connection_key
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: MetaFileSystemConnectionKey
                            • File Path: The full file path to the file to ingest.
                              • Name: file_path
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                        • Partitioned MetaFileSystem Connection
                          • Required Input
                            • Connection Key: The MetaFileSystem connection key.
                              • Name: connection_key
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: MetaFileSystemConnectionKey
                            • File Type: The type of files to read from the directory.
                              • Name: file_type
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: FileExtensions_
                            • Directory Path: The full directory path containing partitioned tabular files.
                              • Name: directory_path
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
              • Dimension Columns: The columns to use as dimensions.
                • Name: dimension_columns
                • Tooltip:
                  • Validation Constraints:
                    • The input must have a minimum length of 1.
                    • This input may be subject to other validation constraints at runtime.
                • Type: list[str]
              • Date Column: The column to use as the date.
                • Name: date_column
                • Tooltip:
                  • Detail:
                    • The date column must in a DateTime readable format.
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: str
              • Value Column: The column to use as the value.
                • Name: value_column
                • Tooltip:
                  • Detail:
                    • The value column must be a numeric (int, float, double, decimal, etc.) column.
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: str
        • Data Fill Definition: List of columns and fill strategies.
          • Name: data_fill_definition
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Data Fill Definition
          • Nested Model: Data Fill Definition
            • Required Input
              • Set Start Date: The starting date range strategy to use when filling in the data.
                • Name: start_date
                • Tooltip:
                  • Detail:
                    • Global: fill all missing data in the dataset. Local: fill missing data within the target's date range. Date: fill missing data between two dates.
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be one of the following
                  • Local Fill Range
                    • Required Input
                      • Local Fill Strategy: Fill missing data within the target's date range.
                        • Name: fill_type
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: Literal
                  • Global Fill Range
                    • Required Input
                      • Global Fill Strategy: Fill missing values between the minimum and maximum dates of the dataset.
                        • Name: fill_type
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: Literal
                  • Date Fill Range
                    • Required Input
                      • Start Date: The start date of the range of data to fill.
                        • Name: date
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: datetime
              • Set End Date: The end date range strategy to use when filling in the data.
                • Name: end_date
                • Tooltip:
                  • Detail:
                    • Global: fill all missing data in the dataset. Local: fill missing data within the target's date range. Date: fill missing data between two dates.
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be one of the following
                  • Local Fill Range
                    • Required Input
                      • Local Fill Strategy: Fill missing data within the target's date range.
                        • Name: fill_type
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: Literal
                  • Global Fill Range
                    • Required Input
                      • Global Fill Strategy: Fill missing values between the minimum and maximum dates of the dataset.
                        • Name: fill_type
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: Literal
                  • Date Fill Range
                    • Required Input
                      • Start Date: The start date of the range of data to fill.
                        • Name: date
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: datetime
              • Fill Method: The method to use to fill in the missing data.
                • Name: fill_strategy
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: FillMethod_
              • Custom Fill Value: The custom value to use when filling in the data.
                • Name: custom_fill_value
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: int | float | NoneType
              • Fill Frequency: The frequency to use when filling in the data. If Auto is selected, the frequency will be determined automatically.
                • Name: fill_freq
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: FillFrequency_
    • Artifacts:

      • Filled Data Table: The data after filling in the missing values.
        • Qualified Key Annotation: filled_data
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@filled_data/data_/data_<int>.parquet
            • A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: NumericDataFillRoutine

Method NameArtifact Keys
numeric_data_fillfilled_data

Was this page helpful?