XGBoostRegressionRoutine
Versions
v1.0.0
Basic Information
Class Name: XGBoostRegressionRoutine
Title: XGBoost Regression
Version: 1.0.0
Author: Drew Shea
Organization: OneStream
Creation Date: 2021-09-09
Default Routine Memory Capacity: 2.0 GB
Tags
Model, Regression, Supervised, ML
Description
Short Description
Predicts values using an XGBoost Regression model.
Long Description
This routine mimics the functionality of an XGBoost regression model. First, in the constructor method the hyperparameters can be specified. Then, in the fit method the data for fit is provided. This sets up the Regression model to accurately predict on new data. Finally, the predict method can be called to predict on new data using the fitted model. It returns a dataframe containing two columns, the dates, and the predicted values over those dates. The routine only predicts one target value, so if the data has dimensionality, this routine may not be the best choice.
Use Cases
1. Predicting Housing Prices
To predict what the price of a house may be based on time based features such as current interest and inflation rates, the number of houses sold in the area recently, and the current average income of the area. Given past data alongside the past price of the house during different dates, XGBoost can be used as a regression model to predict the price of the house given those features. The features and past prices are used to fit the model to the data, and after the model is fitted, prediction can be used to predict the price of the house on new data containing the same features.
Routine Methods
1. Init (Constructor)
- Method:
__init__-
Type: Constructor
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: N/A
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Constructor for the XGBoostRegressionRoutine.
-
Detailed Description:
- Constructs the XGBoostRegression Routine with the input parameters given by the user.
-
Inputs:
- Required Input
- Number of Estimators: The number of estimators to use in the XGboost model.
- Name:
num_estimators - Tooltip:
- Detail:
- Recommended to be between 50 and 500.
- Validation Constraints:
- The input must be greater than 0.
- The input must be less than or equal to 5000.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Learning Rate: The learning rate of the XGboost model. Must be between 0 and 0.4.
- Name:
learning_rate - Tooltip:
- Validation Constraints:
- The input must be greater than or equal to 0.
- The input must be less than or equal to 0.4.
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: float
- Name:
- Gamma: The Gamma of the XGboost model. Higher gamma values yield more conservative algorithms.
- Name:
gamma - Tooltip:
- Detail:
- Recommended to be between 0 and 5.
- Validation Constraints:
- The input must be greater than or equal to 0.
- The input must be less than or equal to 10.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Max Depth: The maximum depth of the XGboost model. Increasing this value will make the model more likely to overfit.
- Name:
max_depth - Tooltip:
- Detail:
- Recommended to be between 1 and 5.
- Validation Constraints:
- The input must be greater than or equal to 0.
- The input must be less than or equal to 10.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Number of Estimators: The number of estimators to use in the XGboost model.
- Required Input
-
Artifacts: No artifacts are returned by this method
-
2. Fit (Method)
- Method:
fit-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: This method has been tested with a dataset with 10k targets and 1.1M rows, and completed in 13 minutes with 5GB of memory. Additionally, a dataset with 40k targets and 29.2M rows completed in 38 minutes with 20GB of memory. Training was completed over 3 months of daily data.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Trains XGBoost regression models on the provided data.
-
Detailed Description:
- This method trains one model per unique group based on the given dimension columns. It adds derived date features (day, month, year) and handles both numeric and categorical inputs. Robust validation is performed to ensure that models are only trained when meaningful features are available.
-
Inputs:
- Required Input
- Source Data Definition: The source data definition to use.
- Name:
source_data_definition - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Time Series Source Data
- Nested Model: Time Series Source Data
- Required Input
- Connection: The connection to the source data.
- Name:
data_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Dimension Columns: The columns to use as dimensions.
- Name:
dimension_columns - Tooltip:
- Validation Constraints:
- The input must have a minimum length of 1.
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Date Column: The column to use as the date.
- Name:
date_column - Tooltip:
- Detail:
- The date column must in a DateTime readable format.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Value Column: The column to use as the value.
- Name:
value_column - Tooltip:
- Detail:
- The value column must be a numeric (int, float, double, decimal, etc.) column.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Connection: The connection to the source data.
- Required Input
- Name:
- Feature Data Definition: The feature data definition to use.
- Name:
feature_data_definitions - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[TimeSeriesTableDefinition]
- Name:
- Source Data Definition: The source data definition to use.
- Optional Input
- Date Range: The date range to fit the model on.
- Name:
time_range - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Start and End Date
- Nested Model: Start and End Date
- Required Input
- Start Date: The inclusive start of the date range (MM/DD/YYYY).
- Name:
start_date - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: datetime
- Name:
- End Date: The inclusive end of the date range (MM/DD/YYYY).
- Name:
end_date - Tooltip:
- Detail:
- Note, the Seasonal ARIMA Anomaly Detector Routine treats the end date as exclusive.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: datetime
- Name:
- Start Date: The inclusive start of the date range (MM/DD/YYYY).
- Required Input
- Name:
- Date Range: The date range to fit the model on.
- Required Input
-
Artifacts: No artifacts are returned by this method
-
3. Predict (Method)
- Method:
predict-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: No
-
Method Limits: This method has been tested with a dataset with 10k targets and 1.1M rows, and completed in 5 minutes with 5GB of memory. Additionally, a dataset with 40k targets and 29.2M rows completed in 53 minutes with 20GB of memory. Prediction was completed for one month of daily data.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Makes predictions using the trained XGBoost regression model.
-
Detailed Description:
- This method takes new input data and produces predicted values based on the model previously fit using the
fit()method. The input data must match the format used during training, including the same dimensions and time-based features. The method returns a DataFrame containing the date and prediction columns.
- This method takes new input data and produces predicted values based on the model previously fit using the
-
Inputs:
- Required Input
- Source Connection: The connection type to use to access the source data.
- Name:
data_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Feature Columns: The columns to use as dimensions.
- Name:
dimension_columns - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Date Column: The column to use as the date.
- Name:
date_column - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Feature Data Definition: The feature data definition to use.
- Name:
feature_data_definitions - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[TimeSeriesTableDefinition]
- Name:
- Source Connection: The connection type to use to access the source data.
- Optional Input
- Date Range: The date range to fit the model on.
- Name:
time_range - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Start and End Date
- Nested Model: Start and End Date
- Required Input
- Start Date: The inclusive start of the date range (MM/DD/YYYY).
- Name:
start_date - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: datetime
- Name:
- End Date: The inclusive end of the date range (MM/DD/YYYY).
- Name:
end_date - Tooltip:
- Detail:
- Note, the Seasonal ARIMA Anomaly Detector Routine treats the end date as exclusive.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: datetime
- Name:
- Start Date: The inclusive start of the date range (MM/DD/YYYY).
- Required Input
- Name:
- Date Range: The date range to fit the model on.
- Required Input
-
Artifacts:
- Predictions: A dataframe of dates and the predicted values over those dates.
- Qualified Key Annotation:
predictions - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@predictions/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
- Predictions: A dataframe of dates and the predicted values over those dates.
-
Interface Definitions
No interface definitions found for this routine
Developer Docs
Routine Typename: XGBoostRegressionRoutine
| Method Name | Artifact Keys |
|---|---|
__init__ | N/A |
fit | N/A |
predict | predictions |