CausalAnalysisRoutine

Versions

0.1.0

v0.1.0

Basic Information

Class Name: CausalAnalysisRoutine

Title: Causal Forest Estimator

Version: 0.1.0

Author: Danny Vega, Sam Hastie, Eric Zhang

Organization: OneStream

Creation Date: 2025-06-27

Default Routine Memory Capacity: 2.0 GB

Description

Short Description

A causal estimator in EconML for revealing how treatment effects vary across individuals.

Long Description

Causal inference tools go beyond correlation to uncover true cause-and-effect relationships in data. CausalForestDML uses double machine learning (DML) to separate the effect of the treatment from other factors, and builds a forest of decision trees that learns how the treatment effects differ across subgroups. This helps target interventions more effectively by revealing which subgroups are impacted most and why.

Use Cases

1. Causal Relationship Analysis

Businesses often want to know not just if something works but where, when, and for whom it works best. A policy, promotion, or event might boost outcomes in some places while having little or no impact elsewhere. Consider a retailer analyzing the impact of holidays on weekly sales. Defining the treatment as whether a week includes a holiday and the covariate as stores allows the user to view the causal relationship between holiday and weekly sales, the outcome, and how that differs across a network of stores. CausalForestDML allows the user to estimate conditional average treatment effects (CATEs), the impact on each individual store, as well as the average treatment effect (ATE), the average impact on sales during a holiday week across all stores.

2. Change in Treatment Value

In many business scenarios, interventions like discounts, promotions, or feature rollouts can affect different customer or product segments in varying ways. Understanding these heterogeneous treatment effects is essential for effective targeting and resource allocation. CausalForestDML can be used to model the impact of a change in treatment level. A user may seek to understand how changing a discount level from 10% (inital treatment value, T0) to 20% (new treatment value, T1) affects the number of units sold, across a range of product categories. The model will estimate the causal effect of increasing the discount, allowing the retailer to identify which products respond strongly to deeper discounts and which do not.

Routine Methods

1. Causal Forest Dml Estimator (Method)

Method: causal_forest_dml_estimator
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: Yes
- Method Limits: This method was tested with a dataset with 100K rows, 10 numerical columns, and 3 categorical columns completed in about 28 minutes with 100GB of memory.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Create an HTML report to help users better understand a dataset.
- Detailed Description:
  - This routine helps users better understand any tabular dataset. It is meant to quickly generate an HTML report providing high-level statistics about the dataset, such as the number of columns (variables), rows (observations), missing values, and duplicate rows. Users can explore the report to view information about each column, and also view a sample of the first or last ten rows of the dataset.
- Inputs:
  - Required Input
    - Data Connection: Connection to the dataset.
      - Name: data_connection
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Treatment Column: The column(s) to be used as treatment variables.
      - Name: treatment_columns
      - Long Description: The variable(s) whose causal effect on the outcome you want to estimate.
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Treatment Types: The data type(s) of each respective treatment variable.
      - Name: treatment_types
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Initial Treatment(T0): The initial value of treatment applied to the outcome. This will be used to model the change in treatment from T0 to T1.
      - Name: treatment_initial
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - New Treatment (T1): The new value of treatment applied to the outcome. This will be used to model the change in treatment from T0 to T1.
      - Name: treatment_new
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Outcomes Column: The column(s) to be used as outcome.
      - Name: outcome_columns
      - Long Description: The dependent variable or target; the result affected by the treatment.
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Outcome Types: The data type(s) of each respective outcome variable.
      - Name: outcome_types
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Features Column: The column(s) to be used as features (affect both T and Y).
      - Name: feature_columns
      - Long Description: The features used to model heterogeneity in treatment effects. These are the variables that determine how the treatment effect varies across individuals. These are the dimensions along which we learn conditional average treatment effects (CATE).
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Feature Types: The data type(s) of each respective feature variable.
      - Name: feature_types
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Control Column: The column(s) to be used as constants (have direct effect on treatment decision and observed outcome). Constants are optional and not required to be input by the user.
      - Name: constant_columns
      - Long Description: Observed covariates that are confounders, variables affecting treatment and outcome, used to control for selection bias in the estimation of causal effects.
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Model Y (Outcome Model): The machine learning model used to predict the outcome variable from features and confounders.
      - Name: model_y
      - Tooltip:
        
        Detail:
        
        Model for predicting the outcome variable from covariates. This removes the predictable part of the outcome that's explained by covariates, isolating the treatment effect. Options: 'auto' (fast, balanced), 'automl' (extensive search, very slow), 'linear models' (fast for continuous, very slow for categorical), 'linear models with polynomial features' (captures interactions), 'random forest' (robust all-around), 'gradient boosted forest' (high accuracy), 'neural net' (complex patterns). Effect: More complex models → better outcome prediction but longer training and overfitting risk. Poor treatment models → biased causal results.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
    - Model T (Treatment Model): The machine learning model used to predict treatment assignment from features and confounders.
      - Name: model_t
      - Tooltip:
        
        Detail:
        
        Model for predicting who receives treatment based on covariates. This estimates propensity scores to account for selection bias in treatment assignment. Options: 'auto' (fast, balanced), 'automl' (extensive search, very slow), 'linear models' (fast for continuous, very slow for categorical), 'linear models with polynomial features' (captures interactions), 'random forest' (robust all-around), 'gradient boosted forest' (high accuracy), 'neural net' (complex patterns). Effect: More complex models → better outcome prediction but longer training and overfitting risk. Poor treatment models → biased causal results.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
    - Maximum Tree Depth: Controls the maximum depth of individual trees in the causal forest.
      - Name: max_depth
      - Tooltip:
        
        Detail:
        
        Maximum depth each tree can grow. Range: 3-15 (default: 10). Increasing → captures complex interactions but risks overfitting, especially with small data. Decreasing → more stable, generalizable results but may miss important patterns. Adjust based on sample size and complexity needs.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: int
    - Minimum Samples per Leaf: The minimum number of samples required at each leaf node of the trees.
      - Name: min_samples_leaf
      - Tooltip:
        
        Detail:
        
        Minimum samples required at each tree leaf. Range: 5-50 (small data: 20-50, large data: 5-20, default: 10). Increasing → smoother, more stable estimates but less granular effects. Decreasing → detects finer heterogeneity but higher overfitting risk. Scale with your dataset size.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: int
    - Number of Trees: The total number of trees to build in the causal forest ensemble.
      - Name: n_estimators
      - Tooltip:
        
        Detail:
        
        Number of trees in the forest ensemble. Must be multiple of 4. Range: 52-1000+ (quick: 52-100, standard: 100-500, precise: 500-1000+, default: 100).Increasing → better performance and stability but longer training with diminishing returns. Decreasing → faster training but less stable estimates. Balance based on accuracy vs. speed needs.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: int
    - Cross-Validation Folds: Number of folds used for cross-validation during model selection and hyperparameter tuning.
      - Name: cv
      - Tooltip:
        
        Detail:
        
        Number of cross-validation folds for model selection. Range: 3-10 (small data <500: 3-fold, medium 500-5000: 5-fold, large >5000: 5-10 fold, default: 5). Increasing → more reliable model selection but significantly longer training. Decreasing → faster training but less robust hyperparameter choices. Choose based on dataset size and time constraints.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: int
- Artifacts:
  - ATE Confidence Intervals: An HTML report presenting the graphs and data for the Average Treatment Effect (ATE) for the Casual Forest routine.
    - Qualified Key Annotation: cfd_report_ate
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cfd_report_ate/data_/html_content.html
        
        The html content.
  - CATE Confidence Intervals: An HTML report presenting the graphs and data for the Conditional Average Treatment Effect (CATE) for the Casual Forest routine.
    - Qualified Key Annotation: cfd_report_cate
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cfd_report_cate/data_/html_content.html
        
        The html content.
  - ATE Data: A dataframe containing Average Treatment Effect (ATE) results for all treatment-outcome combinations.
    - Qualified Key Annotation: cfd_ate_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cfd_ate_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - CATE Data: A dataframe containing all Conditional Average Treatment Effect (CATE) results for all treatments, outcomes, and covariates.
    - Qualified Key Annotation: cfd_cate_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cfd_cate_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - CA Web Dashboard: A web dashboard with outcome, treatment, covariate graph outputs.
    - Qualified Key Annotation: cfd_web_app
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cfd_web_app/data_/data.appref
        
        json file of data relating to web app

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: CausalAnalysisRoutine

Method Name	Artifact Keys
`causal_forest_dml_estimator`	cfd_report_ate, cfd_report_cate, cfd_ate_data, cfd_cate_data, cfd_web_app

Versions​

v0.1.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. Causal Relationship Analysis​

2. Change in Treatment Value​

Routine Methods​

1. Causal Forest Dml Estimator (Method)​

Interface Definitions​

Developer Docs​