FeatureEngineeringAnalysis

Versions

1.0.0

v1.0.0

Basic Information

Class Name: FeatureEngineeringAnalysis

Title: Feature Engineering Analysis

Version: 1.0.0

Author: Kendall Haddigan

Organization: OneStream

Creation Date: 2025-09-23

Default Routine Memory Capacity: 2.0 GB

Description

Short Description

Comprehensive feature engineering and data preprocessing routine.

Long Description

This routine provides end-to-end feature engineering capabilities including null value handling, categorical encoding, numerical normalization, and transformation tracking. It supports various preprocessing strategies and maintains transformation metadata for reproducible data pipelines.

Use Cases

1. Data Preprocessing Pipeline

Use this routine to clean and prepare raw data for machine learning models and analytics workflows. It provides comprehensive data preprocessing capabilities including intelligent null value handling using statistical methods like mean, median, and mode for numerical data, and categorical mode for text data. The routine automatically detects feature types and applies appropriate transformations, ensuring data quality and consistency. It handles missing values systematically, encodes categorical variables using various methods, normalizes numerical features for optimal model performance, and tracks all transformations for reproducible preprocessing pipelines. This makes it ideal for preparing training and testing datasets with consistent preprocessing steps, enabling reliable model development and deployment workflows.

2. Feature Engineering Automation

Automate the feature engineering process for data science workflows by leveraging advanced transformation techniques and intelligent feature type detection. This routine streamlines the creation of engineered features that improve model performance through systematic data transformation processes. It automatically analyzes feature characteristics, determines optimal transformation strategies, and applies sophisticated preprocessing techniques including robust scaling, min-max normalization, and categorical encoding methods. The routine maintains detailed transformation metadata and history, enabling reproducible feature engineering pipelines and consistent data processing across different datasets. It supports complex workflows where multiple transformations need to be applied in sequence, tracks the order of operations for reversibility, and ensures data lineage throughout the transformation process, making it essential for production-ready machine learning systems.

Routine Methods

1. Init (Constructor)

Method: __init__
- Type: Constructor
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: Memory usage scales with dataset size. For a dataset with 2M rows and 4 feature columns, this method is expected to complete with ~5GB of memory allocated. For 5M rows and 3 feature columns, ~12GB of memory allocated. For 10M rows and 3 feature columns, ~25GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Initialize the feature engineering routine with actual data.
- Detailed Description:
  - Analyzes and stores metadata about the data that must hold true over the routine lifecycle.
- Inputs:
  - Required Input
    - Data Source Selection: Data source to transform.
      - Name: datasource
      - Tooltip:
        
        Detail:
        
        Select the data source for feature engineering
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Dimension Selection: Dimension columns for row tracking.
      - Name: dimensions
      - Tooltip:
        
        Detail:
        
        Select columns that uniquely identify rows
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Feature Selection: Feature columns to transform.
      - Name: feature_columns
      - Tooltip:
        
        Detail:
        
        Select the columns to apply transformations to
        
        Validation Constraints:
        
        The input must have a minimum length of 1.
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
- Artifacts: No artifacts are returned by this method

2. Encode (Method)

Method: encode
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: Memory usage scales with dataset size. For a dataset with 2M rows, this method is expected to complete in around 10 seconds with 2GB of memory allocated. For 5M rows, this method is expected to complete in around 8 seconds with 5GB of memory allocated. For 10M rows, memory requirements are expected to scale proportionally with approximately 10GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Encode categorical features using the specified encoding method.
- Detailed Description:
  - This method transforms categorical features into numerical representations that can be used by machine learning algorithms. It supports various encoding strategies including label encoding, one-hot encoding, and target encoding. The method automatically identifies categorical features and applies the chosen encoding method while preserving the ability to reverse the transformation later.
- Inputs:
  - Required Input
    - Encoding Method: Encoding method to apply to categorical features.
      - Name: method
      - Tooltip:
        
        Detail:
        
        Select encoding technique for categorical features
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
- Artifacts:
  - Encoded Data: DataFrame with categorical features encoded using the specified method.
    - Qualified Key Annotation: transformed_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@transformed_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

3. Fill Null (Method)

Method: fill_null
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: Memory usage scales with dataset size. For a dataset with 2M rows, this method is expected to complete in around 8 seconds with 2GB of memory allocated. For 5M rows, this method is expected to complete in around 11 seconds with 5GB of memory allocated. For 10M rows, this method is expected to complete in around 14 seconds with 10GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Fill null values in the dataset using the specified statistical method.
- Detailed Description:
  - This method processes the transformed dataset to identify and fill null values across all feature columns using the specified filling method. It supports various strategies including mean, median, mode, zero, and none (no filling). The method automatically detects whether features are categorical or numerical and applies appropriate filling strategies. All transformations are tracked in the feature column metadata to maintain a complete history of applied operations for reproducibility and potential reversal.
- Inputs:
  - Required Input
    - Fill Null Method: Method to fill null values.
      - Name: method
      - Tooltip:
        
        Detail:
        
        Select how to handle missing values in the dataset
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
- Artifacts:
  - Transformed Data: DataFrame with null values filled using the specified method.
    - Qualified Key Annotation: transformed_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@transformed_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

4. Inverse Transform (Method)

Method: inverse_transform
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: Memory usage scales with dataset size. For a dataset with 2M rows and 4 feature columns, this method is expected to complete in around 8 seconds with 5GB of memory allocated. For 5M rows and 3 feature columns, this method is expected to complete in around 8 seconds with 15GB of memory allocated. For 10M rows and 3 feature columns, this method is expected to complete in around 4-22 seconds with 20GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Reverse all applied transformations to return data to its original form.
- Detailed Description:
  - This method undoes all feature engineering transformations that have been applied to the transformed dataset, returning it to its original state. It processes the transformation history in reverse order, applying inverse operations for normalization, encoding, and null filling. This capability is essential for interpreting model results in terms of the original data features and values.
- Inputs:
  - Required Input
    - Transformed Data Source: Transformed dataset to reverse transformations on.
      - Name: transformed_datasource
      - Tooltip:
        
        Detail:
        
        Select the transformed data source to reverse transformations
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
- Artifacts:
  - Inverse Transformed Data: DataFrame with transformations reversed to original form.
    - Qualified Key Annotation: original_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@original_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

5. Normalize (Method)

Method: normalize
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: Memory usage scales with dataset size. For a dataset with 2M rows, this method is expected to complete in around 5 seconds with 2GB of memory allocated. For 5M rows, this method is expected to complete in around 15 seconds with 5GB of memory allocated. For 10M rows, this method is expected to complete in around 1 second with 10GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Normalize numerical features using the specified normalization method.
- Detailed Description:
  - This method scales numerical features to ensure they have similar ranges and distributions, which improves the performance and stability of machine learning algorithms. It supports various normalization techniques including standard scaling, min-max scaling, robust scaling, and max-abs scaling. The method automatically identifies numerical features and applies the chosen normalization while maintaining transformation metadata for reversibility.
- Inputs:
  - Required Input
    - Normalization Method: Normalization method to apply.
      - Name: method
      - Tooltip:
        
        Detail:
        
        Select normalization technique for numerical features
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
- Artifacts:
  - Normalized Data: DataFrame with features normalized using the specified method.
    - Qualified Key Annotation: transformed_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@transformed_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

6. Transform (Method)

Method: transform
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: This method has been tested with various dataset sizes and feature column configurations. For a dataset with 2M rows and 3 feature columns configured in the constructor method, this method completed in approximately 1 minute with 10 GB of memory allocated, completing 3 distinct transformations for each feature column. For a dataset with 5M rows and 3 feature columns configured in the constructor method, this method completed in approximately 1 minute with 20 GB of memory allocated, completing 3 distinct transformations for each feature column. For a dataset with 10M rows and 3 feature columns configured in the constructor method, this method completed in approximately 3 minutes with 25 GB of memory allocated, completing 3 distinct transformations for each feature column.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Apply all configured transformations to new data using learned transformation parameters.
- Detailed Description:
  - This method takes new data and applies the previously learned transformations to transform it consistently with the original training data. It ensures that the new data has the same structure and feature types as the original data, validating compatibility before applying transformations.
- Inputs:
  - Required Input
    - New Data Source: New dataset to apply transformations to.
      - Name: new_datasource
      - Tooltip:
        
        Detail:
        
        Select the new data source to transform using learned transformations
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
- Artifacts:
  - Fully Transformed Data: DataFrame with all transformations applied in sequence.
    - Qualified Key Annotation: transformed_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@transformed_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

7. Weight (Method)

Method: weight
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: Memory usage scales with dataset size. For a dataset with 2M rows, this method is expected to complete in around 5 seconds with 2GB of memory allocated. For 5M rows, this method is expected to complete in around 12 seconds with 5GB of memory allocated. For 10M rows, this method is expected to complete in around 17 seconds with 10GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Apply feature weighting to adjust the importance of multiple features.
- Detailed Description:
  - This method processes a list of feature weights collected via the Add/Continue workflow pattern. Each feature weight is applied to multiply the feature values and stored as metadata for downstream algorithms (such as clustering) to use for emphasizing or de-emphasizing features.
- Inputs:
  - Required Input
    - Configure Feature Weighting: Configure weights for multiple features.
      - Name: feature_weighting
      - Tooltip:
        
        Detail:
        
        Add feature weights to adjust importance in analysis.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[FeatureEngineeringWeighting]
- Artifacts:
  - Weighted Data: Dataset with weighted features
    - Qualified Key Annotation: transformed_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@transformed_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Feature Weights: Dictionary of feature names and their weights
    - Qualified Key Annotation: feature_weights
    - Aggregate Artifact: False
    - In-Memory Json Accessible: True
    - File Annotations:
      - artifacts_/@feature_weights/data_/data.json
        
        Stored json data. The schema is not known until runtime.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: FeatureEngineeringAnalysis

Method Name	Artifact Keys
`__init__`	N/A
`encode`	transformed_data
`fill_null`	transformed_data
`inverse_transform`	original_data
`normalize`	transformed_data
`transform`	transformed_data
`weight`	transformed_data, feature_weights

Versions​

v1.0.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. Data Preprocessing Pipeline​

2. Feature Engineering Automation​

Routine Methods​

1. Init (Constructor)​

2. Encode (Method)​

3. Fill Null (Method)​

4. Inverse Transform (Method)​

5. Normalize (Method)​

6. Transform (Method)​

7. Weight (Method)​

Interface Definitions​

Developer Docs​