EnsembleSHAPFeatureImpact

Versions

0.1.0

v0.1.0

Basic Information

Class Name: EnsembleSHAPFeatureImpact

Title: Ensemble SHAP Feature Impact

Version: 0.1.0

Author: Tyler Donovan, Spencer Lustilla

Organization: OneStream

Creation Date: 2025-05-26

Default Routine Memory Capacity: 2 GB

Description

Short Description

A routine that uses feature importance rank ensembling to determine the most important features.

Long Description

This Routine is designed to train an ensemble of models on the provided dataset and uses feature importance rank ensembling (FIRE) to determine the most important features in the dataset. The reduction of features can lead to lower computational power and time spent on training/prediciton on new data and removal of noisy/redunant features which can increase model accuracy. The Routine is designed tocalculate SHapley Additive eXplanation (SHAP) values to assist with the feature selection as well as giving insight to how features' values drive the output variable.

Use Cases

1. Feature Selection

When working with predictive modeling and high-dimensional data analysis, the sheer volume of features can overwhelm AI/ML algorithms. Irrelevant, noisy, and/or redundant variables can not only inflate training time but also obscure the true drivers of your target outcome, leading to overfitting and less interpretability. The EnsembleSHAPFeatureImpact (ESFI) routine addresses these issues by deploying an ensemble of modelssuch as tree-based, linear, and more to get diverse feature impacts across the top performing models. By aggregating the feature impact metrics across the models, the algorithm produces a consensus ranking that is less biased than any single approach. Feature reduction accelerates model training, improves generalization on unseen data, and sharpens insight into which inputs truly drive the prediction. For data scientists and analysts working with complex, multi-source datasets where feature definitions may vary or carry redundant signals this preprocessing step ensures that only the highest impacting variables flow into downstream models. The result is a feature set that can enhance predictive accuracy, simplify model interpretation, and potentially reduce computation power and time required.

Routine Methods

1. Init (Constructor)

Method: __init__
- Type: Constructor
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: N/A
- Outputs Dynamic Artifacts: No
- Short Description:
  - The constructor method for Ensemble SHAP Feature Impact Routine.
- Detailed Description:
  - This method initializes the data connection and starts a model ensemble project by analyzing the provided data.
- Inputs:
  - Required Input
    - Source Connection: The connection to source data.
      - Name: data_connection
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Target Column: The target column to train models on.
      - Name: target_col
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
    - Target Column Type: The type of target output column (Categorical or Numerical).
      - Name: target_type
      - Tooltip:
        
        Detail:
        
        Auto will attempt to classify the data, however selection ensures the target type is correct.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
- Artifacts: No artifacts are returned by this method

2. Get Feature Impact (Method)

Method: get_feature_impact
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: This method was tested with a dataset with 100K rows, 10 numerical columns, and 3 categorical columns completed in about 18 minutes with 100GB of memory.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Select the most important features from the dataset.
- Detailed Description:
  - Generate a tabular dataset containing the selected features and their impacts along with a static visual and an interactive web dashboard to conduct additional analysis with.
- Inputs:
  - Required Input
    - Model Test Subset Size: The proportion of data to split for testing the models. (0.0, 0.5].
      - Name: test_size
      - Tooltip:
        
        Detail:
        
        Higher test proportions risk under training models. (Default is 0.2)
        
        Validation Constraints:
        
        The input must be greater than 0.0.
        
        The input must be less than or equal to 0.5.
        
        This input may be subject to other validation constraints at runtime.
      - Type: float
    - Number of Models for FIRE: The number of models to use for FIRE feature selection. [1-7].
      - Name: n_models
      - Tooltip:
        
        Detail:
        
        Too few models can introduce model feature bias; too many can potentially use inaccurate models. (Default is 5)
        
        Validation Constraints:
        
        The input must be greater than or equal to 1.
        
        The input must be less than or equal to 7.
        
        This input may be subject to other validation constraints at runtime.
      - Type: int
    - Feature Selection Threshold: The percentage of cumulative feature impact to determine important features. (0, 1].
      - Name: threshold
      - Tooltip:
        
        Detail:
        
        Lower values are more restrictive and choose less features, but risk leaving out important features. (Default is 0.95)
        
        Validation Constraints:
        
        The input must be greater than 0.0.
        
        The input must be less than or equal to 1.0.
        
        This input may be subject to other validation constraints at runtime.
      - Type: float
- Artifacts:
  - Selected Feature Subset: A filtered dataset containing only the features with highest impactdetermined by ensembled SHAP impact. Can be reused for downstream model training or diagnostics. Columns include: ['feature_1', ..., 'feature_n'] .
    - Qualified Key Annotation: selected_feature_subset
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@selected_feature_subset/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Ensemble Feature Impact Graph: A confidence interval plot showing the normalized feature importance value across the best models.
    - Qualified Key Annotation: feature_importance_graph
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@feature_importance_graph/data_/plotly.pkl
        
        A python plotly figure stored in a python pickle file. Note, this is a binary file type and is not readable in .NET.
      - artifacts_/@feature_importance_graph/data_/plotly.html
        
        An interactive html representation of the plotly figure.
  - Feature Impact Web Dashboard: A web dashboard that gives the ability to explore insights to how features drive the target variable.
    - Qualified Key Annotation: web_dashboard
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@web_dashboard/data_/data.appref
        
        json file of data relating to web app

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: EnsembleSHAPFeatureImpact

Method Name	Artifact Keys
`__init__`	N/A
`get_feature_impact`	selected_feature_subset, feature_importance_graph, web_dashboard

Versions​

v0.1.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. Feature Selection​

Routine Methods​

1. Init (Constructor)​

2. Get Feature Impact (Method)​

Interface Definitions​

Developer Docs​