CausalAnalysisRoutine
Versions
v0.1.0
Basic Information
Class Name: CausalAnalysisRoutine
Title: Causal Forest Estimator
Version: 0.1.0
Author: Danny Vega, Sam Hastie, Eric Zhang
Organization: OneStream
Creation Date: 2025-06-27
Default Routine Memory Capacity: 2.0 GB
Tags
Data Analysis, Data Visualization, Metrics
Description
Short Description
A causal estimator in EconML for revealing how treatment effects vary across individuals.
Long Description
Causal inference tools go beyond correlation to uncover true cause-and-effect relationships in data. CausalForestDML uses double machine learning (DML) to separate the effect of the treatment from other factors, and builds a forest of decision trees that learns how the treatment effects differ across subgroups. This helps target interventions more effectively by revealing which subgroups are impacted most and why.
Use Cases
1. Causal Relationship Analysis
Businesses often want to know not just if something works but where, when, and for whom it works best. A policy, promotion, or event might boost outcomes in some places while having little or no impact elsewhere. Consider a retailer analyzing the impact of holidays on weekly sales. Defining the treatment as whether a week includes a holiday and the covariate as stores allows the user to view the causal relationship between holiday and weekly sales, the outcome, and how that differs across a network of stores. CausalForestDML allows the user to estimate conditional average treatment effects (CATEs), the impact on each individual store, as well as the average treatment effect (ATE), the average impact on sales during a holiday week across all stores.
2. Change in Treatment Value
In many business scenarios, interventions like discounts, promotions, or feature rollouts can affect different customer or product segments in varying ways. Understanding these heterogeneous treatment effects is essential for effective targeting and resource allocation. CausalForestDML can be used to model the impact of a change in treatment level. A user may seek to understand how changing a discount level from 10% (inital treatment value, T0) to 20% (new treatment value, T1) affects the number of units sold, across a range of product categories. The model will estimate the causal effect of increasing the discount, allowing the retailer to identify which products respond strongly to deeper discounts and which do not.
Routine Methods
1. Causal Forest Dml Estimator (Method)
- Method:
causal_forest_dml_estimator-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: Yes
-
Method Limits: This method was tested with a dataset with 100K rows, 10 numerical columns, and 3 categorical columns completed in about 28 minutes with 100GB of memory.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Create an HTML report to help users better understand a dataset.
-
Detailed Description:
- This routine helps users better understand any tabular dataset. It is meant to quickly generate an HTML report providing high-level statistics about the dataset, such as the number of columns (variables), rows (observations), missing values, and duplicate rows. Users can explore the report to view information about each column, and also view a sample of the first or last ten rows of the dataset.
-
Inputs:
- Required Input
- Data Connection: Connection to the dataset.
- Name:
data_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Treatment Column: The column(s) to be used as treatment variables.
- Name:
treatment_columns - Long Description: The variable(s) whose causal effect on the outcome you want to estimate.
- Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Treatment Types: The data type(s) of each respective treatment variable.
- Name:
treatment_types - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Initial Treatment(T0): The initial value of treatment applied to the outcome. This will be used to model the change in treatment from T0 to T1.
- Name:
treatment_initial - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- New Treatment (T1): The new value of treatment applied to the outcome. This will be used to model the change in treatment from T0 to T1.
- Name:
treatment_new - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Outcomes Column: The column(s) to be used as outcome.
- Name:
outcome_columns - Long Description: The dependent variable or target; the result affected by the treatment.
- Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Outcome Types: The data type(s) of each respective outcome variable.
- Name:
outcome_types - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Features Column: The column(s) to be used as features (affect both T and Y).
- Name:
feature_columns - Long Description: The features used to model heterogeneity in treatment effects. These are the variables that determine how the treatment effect varies across individuals. These are the dimensions along which we learn conditional average treatment effects (CATE).
- Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Feature Types: The data type(s) of each respective feature variable.
- Name:
feature_types - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Control Column: The column(s) to be used as constants (have direct effect on treatment decision and observed outcome). Constants are optional and not required to be input by the user.
- Name:
constant_columns - Long Description: Observed covariates that are confounders, variables affecting treatment and outcome, used to control for selection bias in the estimation of causal effects.
- Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Model Y (Outcome Model): The machine learning model used to predict the outcome variable from features and confounders.
- Name:
model_y - Tooltip:
- Detail:
- Model for predicting the outcome variable from covariates. This removes the predictable part of the outcome that's explained by covariates, isolating the treatment effect. Options: 'auto' (fast, balanced), 'automl' (extensive search, very slow), 'linear models' (fast for continuous, very slow for categorical), 'linear models with polynomial features' (captures interactions), 'random forest' (robust all-around), 'gradient boosted forest' (high accuracy), 'neural net' (complex patterns). Effect: More complex models → better outcome prediction but longer training and overfitting risk. Poor treatment models → biased causal results.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Model T (Treatment Model): The machine learning model used to predict treatment assignment from features and confounders.
- Name:
model_t - Tooltip:
- Detail:
- Model for predicting who receives treatment based on covariates. This estimates propensity scores to account for selection bias in treatment assignment. Options: 'auto' (fast, balanced), 'automl' (extensive search, very slow), 'linear models' (fast for continuous, very slow for categorical), 'linear models with polynomial features' (captures interactions), 'random forest' (robust all-around), 'gradient boosted forest' (high accuracy), 'neural net' (complex patterns). Effect: More complex models → better outcome prediction but longer training and overfitting risk. Poor treatment models → biased causal results.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Maximum Tree Depth: Controls the maximum depth of individual trees in the causal forest.
- Name:
max_depth - Tooltip:
- Detail:
- Maximum depth each tree can grow. Range: 3-15 (default: 10). Increasing → captures complex interactions but risks overfitting, especially with small data. Decreasing → more stable, generalizable results but may miss important patterns. Adjust based on sample size and complexity needs.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Minimum Samples per Leaf: The minimum number of samples required at each leaf node of the trees.
- Name:
min_samples_leaf - Tooltip:
- Detail:
- Minimum samples required at each tree leaf. Range: 5-50 (small data: 20-50, large data: 5-20, default: 10). Increasing → smoother, more stable estimates but less granular effects. Decreasing → detects finer heterogeneity but higher overfitting risk. Scale with your dataset size.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Number of Trees: The total number of trees to build in the causal forest ensemble.
- Name:
n_estimators - Tooltip:
- Detail:
- Number of trees in the forest ensemble. Must be multiple of 4. Range: 52-1000+ (quick: 52-100, standard: 100-500, precise: 500-1000+, default: 100).Increasing → better performance and stability but longer training with diminishing returns. Decreasing → faster training but less stable estimates. Balance based on accuracy vs. speed needs.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Cross-Validation Folds: Number of folds used for cross-validation during model selection and hyperparameter tuning.
- Name:
cv - Tooltip:
- Detail:
- Number of cross-validation folds for model selection. Range: 3-10 (small data <500: 3-fold, medium 500-5000: 5-fold, large >5000: 5-10 fold, default: 5). Increasing → more reliable model selection but significantly longer training. Decreasing → faster training but less robust hyperparameter choices. Choose based on dataset size and time constraints.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Data Connection: Connection to the dataset.
- Required Input
-
Artifacts:
-
ATE Confidence Intervals: An HTML report presenting the graphs and data for the Average Treatment Effect (ATE) for the Casual Forest routine.
- Qualified Key Annotation:
cfd_report_ate - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cfd_report_ate/data_/html_content.html- The html content.
- Qualified Key Annotation:
-
CATE Confidence Intervals: An HTML report presenting the graphs and data for the Conditional Average Treatment Effect (CATE) for the Casual Forest routine.
- Qualified Key Annotation:
cfd_report_cate - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cfd_report_cate/data_/html_content.html- The html content.
- Qualified Key Annotation:
-
ATE Data: A dataframe containing Average Treatment Effect (ATE) results for all treatment-outcome combinations.
- Qualified Key Annotation:
cfd_ate_data - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cfd_ate_data/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
CATE Data: A dataframe containing all Conditional Average Treatment Effect (CATE) results for all treatments, outcomes, and covariates.
- Qualified Key Annotation:
cfd_cate_data - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cfd_cate_data/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
CA Web Dashboard: A web dashboard with outcome, treatment, covariate graph outputs.
- Qualified Key Annotation:
cfd_web_app - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cfd_web_app/data_/data.appref- json file of data relating to web app
- Qualified Key Annotation:
-
-
Interface Definitions
No interface definitions found for this routine
Developer Docs
Routine Typename: CausalAnalysisRoutine
| Method Name | Artifact Keys |
|---|---|
causal_forest_dml_estimator | cfd_report_ate, cfd_report_cate, cfd_ate_data, cfd_cate_data, cfd_web_app |