Skip to main content

LinearRegressionRoutine

Versions

v1.0.0

Basic Information

Class Name: LinearRegressionRoutine

Title: Linear Regression

Version: 1.0.0

Author: Graham Eger

Organization: OneStream

Creation Date: 2024-09-18

Default Routine Memory Capacity: 2 GB

Tags

Linear Models, Regression, Statistics

Description

Short Description

A routine to perform linear regression on a dataset

Long Description

This routine mimics the functionality of the pycaret library's linear regression experiment. It sets up a linear regression experiment and fits a linear regression model to the data. The routine can then be used to make predictions on new data. This routine only predicts a single target value, so if the data has large dimensionality, this routine may not be the best choice for the use case.

Use Cases

1. Approximating the Price of Diamonds and Similar Goods

No two diamonds are identical. In order to accurately price goods within a liquid and competitive market like diamonds, many features of the rock must be considered. This routine can be used to create a model for the price of a good based on data provided on recent sales or human expert opinion. Following model fit, the linear regression model can then be used to predict the price of new goods based on the feature set of the new unpriced goods. This routine could be useful for helping to interpret which features have the most impact on the price of a good.

2. Predicting the Usage of Insurance Claims

Insurance companies often have large datasets of claims, and they may want to predict the number of claims they will receive in the future. This routine can be used to create a model for the number of claims based on the features of the claims. Following model fit, the linear regression model can then be used to predict the number of claims in the future based on the feature set of the new unprocessed claims. This routine could be useful for helping to interpret which features have the most impact on the number of claims.

Routine Methods

1. Init (Constructor)
  • Method: __init__
    • Type: Constructor

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: No

    • Method Limits: There are no limits to the constructor method. This method simply saves the input parameters to be utilized in subsequent runs of the fit and predict methods.

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • The constructor for the Linear Regression Routine. This method initializes the routine.
    • Detailed Description:

      • It simply sets up the routine with the parameters.
    • Inputs:

      • Required Input
        • Remove Outliers: Remove outliers from the data? When enabled, the outlying 5% of the dataset will be removed.
          • Name: remove_outliers
          • Long Description: When enabled, the outlying 5% of the dataset will be removed. Consider enabling this option only if the dataset contains outliers which are present in the data used to fit the model, but are not expected in the actual targets of the model. Oftentimes, understanding outliers is more important than removing them. Use caution when enabling this option.
          • Tooltip:
            • Detail:
              • Remove outliers from the data using an isolated forest algorithm
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: bool
        • Imputation Type: The type of imputation to use for missing values. Simple uses the mean value, while iterative uses a linear model to predict the missing value.
          • Name: imputation_type
          • Long Description: The options of 'mean' and 'iterative' imputation are available. Mean imputation replaces missing values with the average of the feature, while iterative imputation uses an on-the-fly generated linear model to predict the missing value for a given dimension.
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: str
        • Fold Count: The number of folds to use for cross validation. Must be smaller than the number of data points in the set.
          • Name: fold_count
          • Long Description: When fitting a linear regression model, the data is split into a number of folds. The model is trained on all but one of the folds and then tested on the remaining fold. This process is repeated for each fold, and the results are averaged to produce a final model. A higher number of folds will result in a more accurate model, but will also take longer to train.
          • Tooltip:
            • Detail:
              • A number between 5 and 10 is recommended to reduce error in the returned model, while still maintaining a reasonable training time. The number of records must exceed this value to be valid.
            • Validation Constraints:
              • The input must be greater than or equal to 2.
              • The input must be less than or equal to 25.
              • This input may be subject to other validation constraints at runtime.
          • Type: int
        • Deterministic: Will the model be deterministic and use a fixed seed for random number generation or will it be non-deterministic?
          • Name: deterministic
          • Long Description: Deterministic models will use a fixed seed for random number generation, while non-deterministic models will use a random seed for random number generation. Turning determinism on can be useful when trying to reproduce results, but can also lead to overfitting.
          • Tooltip:
            • Detail:
              • Deterministic models will use a fixed seed for random number generation, while non-deterministic models will use a random seed for random number generation.
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: bool
        • Models to Run: The models to run for the linear regression routine.
          • Name: models_to_run
          • Long Description: The models to run for the linear regression routine. If "all" is selected, all models will be run.
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: list[str]
    • Artifacts: No artifacts are returned by this method

2. Fit (Method)
  • Method: fit
    • Type: Method

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: No

    • Method Limits: This method completes in about 3 hours when tested with a dataset containing 1M rows, 10 feature columns, and one value column when allocated 80GB of memory. Fitting models is often more computationally expensive than running predictions, as is the case for this Routine. When provided with just 10K rows, 10 feature columns, and one value column, this method completes in roughly 10 minutes.

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • Fit the linear regression model to the data.
    • Detailed Description:

      • Fit the linear regression model to the dataset defined in the parameters and save the results to an artifact. Following the call to fit, the routine should be in the state where it can be used to make predictions.
    • Inputs:

      • Required Input
        • Source Connection: The source data definition for the fit portion of the linear regression routine.
          • Name: source_data_definition
          • Long Description: The source data definition for the fit portion of the linear regression routine. This can be provided using any data source that contains the columns needed to fit the model.
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Tabular Connection
          • Nested Model: Tabular Connection
            • Required Input
              • Connection: The connection type to use to access the source data.
                • Name: tabular_connection
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be one of the following
                  • SQL Server Connection
                    • Required Input
                      • Database Resource: The name of the database resource to connect to.
                        • Name: database_resource
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Database Name: The name of the database to connect to.
                        • Name: database_name
                        • Tooltip:
                          • Detail:
                            • Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Table Name: The name of the table to use.
                        • Name: table_name
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Path: The full file path to the file to ingest.
                        • Name: file_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • Partitioned MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Type: The type of files to read from the directory.
                        • Name: file_type
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: FileExtensions_
                      • Directory Path: The full directory path containing partitioned tabular files.
                        • Name: directory_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
        • Value Column: The target value column name for the linear regression routine.
          • Name: value_column
          • Long Description: The target value column name for the linear regression routine. This column should be the column that the model will predict.
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: str
    • Artifacts:

      • Report: Report on the Fit of the Linear Regression Model. This report contains a variety of statistics on the fit of the model. The report contains a summary of the N folds used during the cross-validation process. The report also contains 3 plots, a Feature importance plot, a Residuals plot, and a Prediction Error plot.
        • Qualified Key Annotation: report
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@report/data_/html_content.html
            • The html content.
3. Predict (Method)
  • Method: predict
    • Type: Method

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: No

    • Method Limits: Predictions complete much quicker than fitting does. For example, after fitting the model on 1M rows, running the predict method on a dataset with 5M new rows and 10 features completes in a matter of minutes. Rapid prediction times has been consistent across various datasets.

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • Predict with the linear regression model
    • Detailed Description:

      • Utilizes the fitted model to make predictions given newly inputted data. The input data must contain the same columns as the data used to fit. The method makes predictions for the value based on the dimension columns input. The method returns a dataframe containing the dates and predicted values wrapped in a LinearRegressionArtifactDefinition.
    • Inputs:

      • Required Input
        • Source Connection: The source data definition for the predict portion of the linear regression routine.
          • Name: source_data_definition
          • Long Description: The source data definition for the predict portion of the linear regression routine. This should be the same data source used to fit the model. This can be provided using any data source that contains the same columns as the data used to fit the model.
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Tabular Connection
          • Nested Model: Tabular Connection
            • Required Input
              • Connection: The connection type to use to access the source data.
                • Name: tabular_connection
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be one of the following
                  • SQL Server Connection
                    • Required Input
                      • Database Resource: The name of the database resource to connect to.
                        • Name: database_resource
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Database Name: The name of the database to connect to.
                        • Name: database_name
                        • Tooltip:
                          • Detail:
                            • Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Table Name: The name of the table to use.
                        • Name: table_name
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Path: The full file path to the file to ingest.
                        • Name: file_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • Partitioned MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Type: The type of files to read from the directory.
                        • Name: file_type
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: FileExtensions_
                      • Directory Path: The full directory path containing partitioned tabular files.
                        • Name: directory_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
        • Model to use for Predict: The model to use for the regression predict routine.
          • Name: model
          • Long Description: The model to use for the regression predict routine.
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: str
        • Value Column: The target value column name for the linear regression routine.
          • Name: value_column
          • Long Description: The target value column name for the linear regression routine. This column should be the same as the target value column used to fit the model.
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: str
    • Artifacts:

      • Prediction Result Table: The data to be used for the linear regression. The data will be in a tabular format with the same columns as the input data. An additional column will be added to the end of the table with the predicted values. This column will be titled 'Prediction_Value'.
        • Qualified Key Annotation: predict_out
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@predict_out/data_/data_<int>.parquet
            • A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: LinearRegressionRoutine

Method NameArtifact Keys
__init__N/A
fitreport
predictpredict_out

Was this page helpful?