Skip to main content

EnsembleSHAPFeatureImpact

Versions

v0.1.0

Basic Information

Class Name: EnsembleSHAPFeatureImpact

Title: Ensemble SHAP Feature Impact

Version: 0.1.0

Author: Tyler Donovan, Spencer Lustilla

Organization: OneStream

Creation Date: 2025-05-26

Default Routine Memory Capacity: 2 GB

Tags

Classification, Regression, Supervised, Optimization, Dimensionality Reduction, Ensemble, Feature Selection, Data Analysis, Interpretability, Data Visualization

Description

Short Description

A routine that uses feature importance rank ensembling to determine the most important features.

Long Description

This Routine is designed to train an ensemble of models on the provided dataset and uses feature importance rank ensembling (FIRE) to determine the most important features in the dataset. The reduction of features can lead to lower computational power and time spent on training/prediciton on new data and removal of noisy/redunant features which can increase model accuracy. The Routine is designed tocalculate SHapley Additive eXplanation (SHAP) values to assist with the feature selection as well as giving insight to how features' values drive the output variable.

Use Cases

1. Feature Selection

When working with predictive modeling and high-dimensional data analysis, the sheer volume of features can overwhelm AI/ML algorithms. Irrelevant, noisy, and/or redundant variables can not only inflate training time but also obscure the true drivers of your target outcome, leading to overfitting and less interpretability. The EnsembleSHAPFeatureImpact (ESFI) routine addresses these issues by deploying an ensemble of modelssuch as tree-based, linear, and more to get diverse feature impacts across the top performing models. By aggregating the feature impact metrics across the models, the algorithm produces a consensus ranking that is less biased than any single approach. Feature reduction accelerates model training, improves generalization on unseen data, and sharpens insight into which inputs truly drive the prediction. For data scientists and analysts working with complex, multi-source datasets where feature definitions may vary or carry redundant signals this preprocessing step ensures that only the highest impacting variables flow into downstream models. The result is a feature set that can enhance predictive accuracy, simplify model interpretation, and potentially reduce computation power and time required.

Routine Methods

1. Init (Constructor)
  • Method: __init__
    • Type: Constructor

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: No

    • Method Limits: N/A

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • The constructor method for Ensemble SHAP Feature Impact Routine.
    • Detailed Description:

      • This method initializes the data connection and starts a model ensemble project by analyzing the provided data.
    • Inputs:

      • Required Input
        • Source Connection: The connection to source data.
          • Name: data_connection
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Tabular Connection
          • Nested Model: Tabular Connection
            • Required Input
              • Connection: The connection type to use to access the source data.
                • Name: tabular_connection
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be one of the following
                  • SQL Server Connection
                    • Required Input
                      • Database Resource: The name of the database resource to connect to.
                        • Name: database_resource
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Database Name: The name of the database to connect to.
                        • Name: database_name
                        • Tooltip:
                          • Detail:
                            • Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Table Name: The name of the table to use.
                        • Name: table_name
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Path: The full file path to the file to ingest.
                        • Name: file_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • Partitioned MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Type: The type of files to read from the directory.
                        • Name: file_type
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: FileExtensions_
                      • Directory Path: The full directory path containing partitioned tabular files.
                        • Name: directory_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
        • Target Column: The target column to train models on.
          • Name: target_col
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: str
        • Target Column Type: The type of target output column (Categorical or Numerical).
          • Name: target_type
          • Tooltip:
            • Detail:
              • Auto will attempt to classify the data, however selection ensures the target type is correct.
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: str
    • Artifacts: No artifacts are returned by this method

2. Get Feature Impact (Method)
  • Method: get_feature_impact
    • Type: Method

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: No

    • Method Limits: This method was tested with a dataset with 100K rows, 10 numerical columns, and 3 categorical columns completed in about 18 minutes with 100GB of memory.

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • Select the most important features from the dataset.
    • Detailed Description:

      • Generate a tabular dataset containing the selected features and their impacts along with a static visual and an interactive web dashboard to conduct additional analysis with.
    • Inputs:

      • Required Input
        • Model Test Subset Size: The proportion of data to split for testing the models. (0.0, 0.5].
          • Name: test_size
          • Tooltip:
            • Detail:
              • Higher test proportions risk under training models. (Default is 0.2)
            • Validation Constraints:
              • The input must be greater than 0.0.
              • The input must be less than or equal to 0.5.
              • This input may be subject to other validation constraints at runtime.
          • Type: float
        • Number of Models for FIRE: The number of models to use for FIRE feature selection. [1-7].
          • Name: n_models
          • Tooltip:
            • Detail:
              • Too few models can introduce model feature bias; too many can potentially use inaccurate models. (Default is 5)
            • Validation Constraints:
              • The input must be greater than or equal to 1.
              • The input must be less than or equal to 7.
              • This input may be subject to other validation constraints at runtime.
          • Type: int
        • Feature Selection Threshold: The percentage of cumulative feature impact to determine important features. (0, 1].
          • Name: threshold
          • Tooltip:
            • Detail:
              • Lower values are more restrictive and choose less features, but risk leaving out important features. (Default is 0.95)
            • Validation Constraints:
              • The input must be greater than 0.0.
              • The input must be less than or equal to 1.0.
              • This input may be subject to other validation constraints at runtime.
          • Type: float
    • Artifacts:

      • Selected Feature Subset: A filtered dataset containing only the features with highest impactdetermined by ensembled SHAP impact. Can be reused for downstream model training or diagnostics. Columns include: ['feature_1', ..., 'feature_n'] .

        • Qualified Key Annotation: selected_feature_subset
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@selected_feature_subset/data_/data_<int>.parquet
            • A partitioned set of parquet files where each file will have no more than 1000000 rows.
      • Ensemble Feature Impact Graph: A confidence interval plot showing the normalized feature importance value across the best models.

        • Qualified Key Annotation: feature_importance_graph
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@feature_importance_graph/data_/plotly.pkl
            • A python plotly figure stored in a python pickle file. Note, this is a binary file type and is not readable in .NET.
          • artifacts_/@feature_importance_graph/data_/plotly.html
            • An interactive html representation of the plotly figure.
      • Feature Impact Web Dashboard: A web dashboard that gives the ability to explore insights to how features drive the target variable.

        • Qualified Key Annotation: web_dashboard
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@web_dashboard/data_/data.appref
            • json file of data relating to web app

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: EnsembleSHAPFeatureImpact

Method NameArtifact Keys
__init__N/A
get_feature_impactselected_feature_subset, feature_importance_graph, web_dashboard

Was this page helpful?