Skip to main content

XGBoostRegressionRoutine

Versions

v1.0.0

Basic Information

Class Name: XGBoostRegressionRoutine

Title: XGBoost Regression

Version: 1.0.0

Author: Drew Shea

Organization: OneStream

Creation Date: 2021-09-09

Default Routine Memory Capacity: 2.0 GB

Tags

Model, Regression, Supervised, ML

Description

Short Description

Predicts values using an XGBoost Regression model.

Long Description

This routine mimics the functionality of an XGBoost regression model. First, in the constructor method the hyperparameters can be specified. Then, in the fit method the data for fit is provided. This sets up the Regression model to accurately predict on new data. Finally, the predict method can be called to predict on new data using the fitted model. It returns a dataframe containing two columns, the dates, and the predicted values over those dates. The routine only predicts one target value, so if the data has dimensionality, this routine may not be the best choice.

Use Cases

1. Predicting Housing Prices

To predict what the price of a house may be based on time based features such as current interest and inflation rates, the number of houses sold in the area recently, and the current average income of the area. Given past data alongside the past price of the house during different dates, XGBoost can be used as a regression model to predict the price of the house given those features. The features and past prices are used to fit the model to the data, and after the model is fitted, prediction can be used to predict the price of the house on new data containing the same features.

Routine Methods

1. Init (Constructor)
  • Method: __init__
    • Type: Constructor

    • Allow In-Memory Execution: No

    • Read Only: No

    • Method Limits: N/A

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • Constructor for the XGBoostRegressionRoutine.
    • Detailed Description:

      • Constructs the XGBoostRegression Routine with the input parameters given by the user.
    • Inputs:

      • Required Input
        • Number of Estimators: The number of estimators to use in the XGboost model.
          • Name: num_estimators
          • Tooltip:
            • Detail:
              • Recommended to be between 50 and 500.
            • Validation Constraints:
              • The input must be greater than 0.
              • The input must be less than or equal to 5000.
              • This input may be subject to other validation constraints at runtime.
          • Type: int
        • Learning Rate: The learning rate of the XGboost model. Must be between 0 and 0.4.
          • Name: learning_rate
          • Tooltip:
            • Validation Constraints:
              • The input must be greater than or equal to 0.
              • The input must be less than or equal to 0.4.
              • This input may be subject to other validation constraints at runtime.
          • Type: float
        • Gamma: The Gamma of the XGboost model. Higher gamma values yield more conservative algorithms.
          • Name: gamma
          • Tooltip:
            • Detail:
              • Recommended to be between 0 and 5.
            • Validation Constraints:
              • The input must be greater than or equal to 0.
              • The input must be less than or equal to 10.
              • This input may be subject to other validation constraints at runtime.
          • Type: int
        • Max Depth: The maximum depth of the XGboost model. Increasing this value will make the model more likely to overfit.
          • Name: max_depth
          • Tooltip:
            • Detail:
              • Recommended to be between 1 and 5.
            • Validation Constraints:
              • The input must be greater than or equal to 0.
              • The input must be less than or equal to 10.
              • This input may be subject to other validation constraints at runtime.
          • Type: int
    • Artifacts: No artifacts are returned by this method

2. Fit (Method)
  • Method: fit
    • Type: Method

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: No

    • Method Limits: This method has been tested with a dataset with 10k targets and 1.1M rows, and completed in 13 minutes with 5GB of memory. Additionally, a dataset with 40k targets and 29.2M rows completed in 38 minutes with 20GB of memory. Training was completed over 3 months of daily data.

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • Trains XGBoost regression models on the provided data.
    • Detailed Description:

      • This method trains one model per unique group based on the given dimension columns. It adds derived date features (day, month, year) and handles both numeric and categorical inputs. Robust validation is performed to ensure that models are only trained when meaningful features are available.
    • Inputs:

      • Required Input
        • Source Data Definition: The source data definition to use.
          • Name: source_data_definition
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Time Series Source Data
          • Nested Model: Time Series Source Data
            • Required Input
              • Connection: The connection to the source data.
                • Name: data_connection
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be an instance of Tabular Connection
                • Nested Model: Tabular Connection
                  • Required Input
                    • Connection: The connection type to use to access the source data.
                      • Name: tabular_connection
                      • Tooltip:
                        • Validation Constraints:
                          • This input may be subject to other validation constraints at runtime.
                      • Type: Must be one of the following
                        • SQL Server Connection
                          • Required Input
                            • Database Resource: The name of the database resource to connect to.
                              • Name: database_resource
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                            • Database Name: The name of the database to connect to.
                              • Name: database_name
                              • Tooltip:
                                • Detail:
                                  • Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                            • Table Name: The name of the table to use.
                              • Name: table_name
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                        • MetaFileSystem Connection
                          • Required Input
                            • Connection Key: The MetaFileSystem connection key.
                              • Name: connection_key
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: MetaFileSystemConnectionKey
                            • File Path: The full file path to the file to ingest.
                              • Name: file_path
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
                        • Partitioned MetaFileSystem Connection
                          • Required Input
                            • Connection Key: The MetaFileSystem connection key.
                              • Name: connection_key
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: MetaFileSystemConnectionKey
                            • File Type: The type of files to read from the directory.
                              • Name: file_type
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: FileExtensions_
                            • Directory Path: The full directory path containing partitioned tabular files.
                              • Name: directory_path
                              • Tooltip:
                                • Validation Constraints:
                                  • This input may be subject to other validation constraints at runtime.
                              • Type: str
              • Dimension Columns: The columns to use as dimensions.
                • Name: dimension_columns
                • Tooltip:
                  • Validation Constraints:
                    • The input must have a minimum length of 1.
                    • This input may be subject to other validation constraints at runtime.
                • Type: list[str]
              • Date Column: The column to use as the date.
                • Name: date_column
                • Tooltip:
                  • Detail:
                    • The date column must in a DateTime readable format.
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: str
              • Value Column: The column to use as the value.
                • Name: value_column
                • Tooltip:
                  • Detail:
                    • The value column must be a numeric (int, float, double, decimal, etc.) column.
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: str
        • Feature Data Definition: The feature data definition to use.
          • Name: feature_data_definitions
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: list[TimeSeriesTableDefinition]
      • Optional Input
        • Date Range: The date range to fit the model on.
          • Name: time_range
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Start and End Date
          • Nested Model: Start and End Date
            • Required Input
              • Start Date: The inclusive start of the date range (MM/DD/YYYY).
                • Name: start_date
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: datetime
              • End Date: The inclusive end of the date range (MM/DD/YYYY).
                • Name: end_date
                • Tooltip:
                  • Detail:
                    • Note, the Seasonal ARIMA Anomaly Detector Routine treats the end date as exclusive.
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: datetime
    • Artifacts: No artifacts are returned by this method

3. Predict (Method)
  • Method: predict
    • Type: Method

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: No

    • Method Limits: This method has been tested with a dataset with 10k targets and 1.1M rows, and completed in 5 minutes with 5GB of memory. Additionally, a dataset with 40k targets and 29.2M rows completed in 53 minutes with 20GB of memory. Prediction was completed for one month of daily data.

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • Makes predictions using the trained XGBoost regression model.
    • Detailed Description:

      • This method takes new input data and produces predicted values based on the model previously fit using the fit() method. The input data must match the format used during training, including the same dimensions and time-based features. The method returns a DataFrame containing the date and prediction columns.
    • Inputs:

      • Required Input
        • Source Connection: The connection type to use to access the source data.
          • Name: data_connection
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Tabular Connection
          • Nested Model: Tabular Connection
            • Required Input
              • Connection: The connection type to use to access the source data.
                • Name: tabular_connection
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be one of the following
                  • SQL Server Connection
                    • Required Input
                      • Database Resource: The name of the database resource to connect to.
                        • Name: database_resource
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Database Name: The name of the database to connect to.
                        • Name: database_name
                        • Tooltip:
                          • Detail:
                            • Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Table Name: The name of the table to use.
                        • Name: table_name
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Path: The full file path to the file to ingest.
                        • Name: file_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • Partitioned MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Type: The type of files to read from the directory.
                        • Name: file_type
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: FileExtensions_
                      • Directory Path: The full directory path containing partitioned tabular files.
                        • Name: directory_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
        • Feature Columns: The columns to use as dimensions.
          • Name: dimension_columns
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: list[str]
        • Date Column: The column to use as the date.
          • Name: date_column
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: str
        • Feature Data Definition: The feature data definition to use.
          • Name: feature_data_definitions
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: list[TimeSeriesTableDefinition]
      • Optional Input
        • Date Range: The date range to fit the model on.
          • Name: time_range
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Start and End Date
          • Nested Model: Start and End Date
            • Required Input
              • Start Date: The inclusive start of the date range (MM/DD/YYYY).
                • Name: start_date
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: datetime
              • End Date: The inclusive end of the date range (MM/DD/YYYY).
                • Name: end_date
                • Tooltip:
                  • Detail:
                    • Note, the Seasonal ARIMA Anomaly Detector Routine treats the end date as exclusive.
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: datetime
    • Artifacts:

      • Predictions: A dataframe of dates and the predicted values over those dates.
        • Qualified Key Annotation: predictions
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@predictions/data_/data_<int>.parquet
            • A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: XGBoostRegressionRoutine

Method NameArtifact Keys
__init__N/A
fitN/A
predictpredictions

Was this page helpful?