MLClassifier

Versions

4.0.0

v4.0.0

Basic Information

Class Name: MLClassifier

Title: ML Classification

Version: 4.0.0

Author: Jeff Robinson

Organization: OneStream

Creation Date: 2024-09-23

Default Routine Memory Capacity: 2.0 GB

Description

Short Description

Classification is a fundamental machine learning task used to predict categorical outcomes based on input features.

Long Description

This routine utilizes a supervised ML classification technique to categorize data into predefined classes. It is designed to accommodate a variety of data types, handling both numeric and categorical features by automatically applying necessary transformations such as encoding and imputations. The ML Classification routine can train, compare, and select the optimal model capable of learning from labeled training data and identifying patterns that enable it to assign new data points to the correct class. Binary (categorizing into one of two classes), multi-class (categorizing into one of more than two classes), and multi-output/multi-label (categorizing into multiple classes) ML classification are all supported, making it suitable for any number of ML classification tasks such as customer churn prediction, product purchase likelihood, and credit risk assessment.

Use Cases

1. Customer Churn (Binary Classification)

Customer churn prediction is a binary ML classification problem where the goal is to identify whether a customer is likely to leave (churn) or stay with a business. Using a historical dataset containing customer financial data (e.g., monthly charges, total spending), behavioral data (e.g., usage patterns), and transaction history (e.g., overdue payments, service downgrades), an ML classification model can be trained to predict churn. The target outcome is binary: churn (1) or not churn (0).The trained model can be used to potentially identify high-risk customers, allowing for targeted retention strategies to reduce customer loss and improve financial outcomes for the business.

2. Product Purchase Prediction (Binary Classification)

For retail and e-commerce businesses, predicting whether a customer will purchase a product (or not) can significantly improve marketing strategies and sales conversion rates. In this binary ML classification use case, the target is to predict purchase (1) or no purchase (0) based on features such as product page views, time spent on the website, customer demographics, and past purchase history. Companies can build an ML classification model that predicts which customers are most likely to buy a product. This enables more targeted marketing and personalized offers, leading to improved conversion rates and customer satisfaction.

3. Credit Risk Assessment (Multi-class Classification)

For financial institutions, accurately assessing the credit risk of loan applicants is crucial for managing risk and making informed lending decisions. In this multi-class ML classification use case, the target is to predict the credit risk level of applicants (as high, medium, or low) based on various features (such as credit history, income, employment status, debt-to-income ratio, and other relevant financial indicators). Companies can build an ML classification model that categorizes applicants into one of the three risk levels. This enables institutions to tailor their lending strategies accordingly. High risk applicants may be denied loans or offered loans with higher interest rates to mitigate losses, medium risk applicants may be offered standard loans with standard rates, and low risk applicants would be offered the more favorable terms. By implementing a multi-class model like this, companies can improve their decision-making process, reduce possibility of defaults, and optimize potential profitability.

4. Customer Feedback Analysis (Multi-output/Multi-label Classification)

For businesses, understanding customer feedback is crucial for improving products and services. In this multi-label ML classification use case, the target is to analyze customer reviews and categorize them into multiple relevant labels such as product quality, customer service, delivery experience, and pricing. Each review can be assigned multiple labels, reflecting the various aspects mentioned by the customer and then utilized by different teams to form actionable plans such as making quality changes or affirming quality controls efforts, adjusting prices, and addressing customer/reputation opportunities.

Routine Methods

1. Init (Constructor)

Method: __init__
- Type: Constructor
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: There are no limits for the constructor method.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Constructor for the Binary Classification Routine.
- Detailed Description:
  - This constructor sets up the routine with the necessary API instance and parameters for training and evaluation of binary ML classification models. It prepares the environment for model training, including setting up the target variable, session ID for reproducibility, and other hyperparameters needed for ML classification.
- Inputs:
  - Required Input
    - Model Training Configuration Parameters: A mix of required and optional constructor parameters for ML Classification.
      - Name: training_config_parameters
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of PyCaret Classifier
      - Nested Model: PyCaret Classifier
        
        Required Input
        
        Optimize Model: Metric to optimize during tuning (e.g. Accuracy, Precision, Recall).
        
        Name: optimize_model
        
        Tooltip:
        
        Detail:
        
        Accuracy: Ratio of correctly predicted observations to total. Good for balanced datasets where false positives/negatives are equally important.
        
        Precision: Ratio of correctly predicted positives to total predicted. Important when false positives are costly but false negatives could be critical (e.g., viable drilling sites).
        
        Recall: Ratio of correctly predicted positives to all actual positives. Prioritizes minimizing false negatives (e.g., medical diagnosis—missed diagnosis).
        
        F1: Harmonic mean of Precision & Recall. Less interpretable but useful for imbalanced data (e.g., rare event detection).
        
        AUC: Area under ROC curve. Useful for ranking models, especially on imbalanced data.
        
        These are the more common metrics used for optimizing classification models. See 'Battling Imbalanced classes: Strategies for More Reliable Machine Learning Models' article for more information.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Training Size: Default is 0.8 (80%). Value must be between 1 and 0.
        
        Name: train_size
        
        Tooltip:
        
        Validation Constraints:
        
        The input must be greater than 0.
        
        The input must be less than 1.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: float
        
        Advanced Cross-Validation: Optional cross-validation settings. None accepts default settings.
        
        Name: fold_splitting_options
        
        Tooltip:
        
        Detail:
        
        K-Fold: Provides a general approach for cross-validation but may lead to class imbalances in each fold for classification tasks.
        
        Stratified K-Fold: Maintains class balance in each fold, making it ideal for imbalanced classification tasks but may not be suitable for regression
        
        Time Series Split: Ensures temporal order is maintained for forecasting but does not shuffle data, which can limit variance reduction.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be an instance of Advanced Fold Options
        
        Nested Model: Advanced Fold Options
        
        Required Input
        
        Folds: Number of subsets or splits used in cross-validation. Default is 10 but can specify any number of folds between 1 and 20.
        
        Name: fold_num
        
        Tooltip:
        
        Validation Constraints:
        
        The input must be greater than 0.
        
        The input must be less than 20.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: int
        
        Fold Shuffle: Controls the shuffle parameter of cross-validation. Only applicable when fold_strategy is K-Fold or Stratified K-Fold.
        
        Name: fold_shuffle
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: bool
        
        Fold Strategy: Choice of cross validation strategy. Default is Stratified K-Fold.
        
        Name: fold_strategy
        
        Tooltip:
        
        Detail:
        
        K-Fold: Provides a general approach for cross-validation but may lead to class imbalances in each fold for classification tasks.
        
        Stratified K-Fold: Maintains class balance in each fold, making it ideal for imbalanced classification tasks but may not be suitable for regression
        
        Time Series Split: Ensures temporal order is maintained for forecasting but does not shuffle data, which can limit variance reduction.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Optional Input
        
        Session ID: Optional Model Session ID is used to ensure reproducibility of experiments. Defaults to 42.
        
        Name: session_id
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Optional[int]
- Artifacts: No artifacts are returned by this method

2. Create Web App (Method)

Method: create_web_app
- Type: Method
- Allow In-Memory Execution: Yes
- Read Only: No
- Method Limits: There are no limits for this method. It is expected to complete within a minute with 2GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Creates the classification web application.
- Detailed Description:
  - This method creates a web dashboard responsible for showing users method outputs and allowing for a deeper understanding of train and predict results.
- Inputs:
  - No input parameters
- Artifacts:
  - Classification Web App: Dashboard to analyze results from the classification predict and train routine runs.
    - Qualified Key Annotation: web_app
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@web_app/data_/data.appref
        
        json file of data relating to web app

3. Observation Shap View (Method)

Method: observation_shap_view
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: This method generally completes quicker than the train method. With a small dataset containing 300 rows of data with 1 target column and 4 feature columns, this method is expected to complete in about 3 minutes with 2GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Generate waterfall plots for the shap values of a given observation(s).
- Detailed Description:
  - This method generates data visualizations for shap values if the user is running binary or multi-class classification.
- Inputs:
  - Required Input
    - Source Connection: The connection information source data.
      - Name: data_connection
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Index Selection: Identifier column(s) that will ID the specific observation (row).
      - Name: index_selection
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Probability threshold: The probability threshold for the probability distribution.
      - Name: probability_threshold
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: float
- Artifacts:
  - Observation Shap Plots: Waterfall plots for the shap values of the given observation(s).
    - Qualified Key Annotation: observation_shap_plots
    - Aggregate Artifact: True
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@observation_shap_plots/data_
        
        Folder containing inner artifacts
    - Nested Artifacts: This collection includes Key-based collection of Artifacts
  - Observation Shap Plots: Waterfall plots for the shap values of the given observation(s).
    - Qualified Key Annotation: N/A

4. Predict (Method)

Method: predict
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: This method generally completes quicker than the train method. With a small dataset containing 300 rows of data with 1 target column and 4 feature columns, this method is expected to complete in about 2 minutes with 2GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Predict ML classification with trained model.
- Detailed Description:
  - This method utilizes trained, tuned models for making binary, multi-label, or multi-output classifications.
- Inputs:
  - Required Input
    - Source Connection: The connection information source data.
      - Name: data_connection
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Index Selection: Index field to be included in prediction output.
      - Name: index_selection
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Trained Model: Trained model for making predictions.
      - Name: model_name
      - Tooltip:
        
        Detail:
        
        Trained models available for this routine instance.
        
        Validation Constraints:
        
        The input must have a minimum length of 1.
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
    - Probability Threshold: Defaults to threshold set in training which is 0.5 unless otherwise specified. Must be between 1 and 0.
      - Name: probability_threshold
      - Tooltip:
        
        Detail:
        
        The probability threshold here will only be used for binary targets
        
        Validation Constraints:
        
        The input must be greater than 0.
        
        The input must be less than 1.
        
        This input may be subject to other validation constraints at runtime.
      - Type: float
- Artifacts:
  - Classification Prediction Report: A Classification prediction report containing top 10 predictions (by index) and Score vs Prediction plot for binary classification. Multiclass models do not produce a prediction score which is required for the Prediction chart.
    - Qualified Key Annotation: prediction_report
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@prediction_report/data_/html_content.html
        
        The html content.
  - Prediction Output: The full prediction dataframe from the predict routine.
    - Qualified Key Annotation: prediction_output
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@prediction_output/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Shap Values: A dataframe containing the shap values for each class in every observation in the prediction dataset.
    - Qualified Key Annotation: shap_dataframe
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@shap_dataframe/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

5. Train (Method)

Method: train
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: No
- Method Limits: This method is subject to timeouts depending on which model(s) are selected. These timeouts may occur with 100K rows of data with 1 target column and 4 feature columns selected, and with a single model selected. With a smaller dataset including 5K rows and 4 feature columns, and when allowing all models to be trained, this method is expected to complete in around 10 minutes with 2GB of memory allocated while many models are included. When using a 100K row dataset with 1 target and 5 feature columns, this method can be expected to complete in around 3 hours when a single model is included. It is advised to utilize fewer models and smaller datasets when training to optimize runtime.
- Outputs Dynamic Artifacts: Yes
- Short Description:
  - Train binary, multi-class, or multi-label/multi-output and save the best performing model.
- Detailed Description:
  - This method determines which models to train based on the provided parameters, trains up to 5 models, selects the best performing one, tunes it, and then saves the final model.
- Inputs:
  - Required Input
    - Source Connection: The connection information source data.
      - Name: data_connection
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Data Exploration View: Whether to include a data exploration view as part of the routine artifacts.
      - Name: include_data_exploration
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: bool
    - Create Feature Win Percentage: Whether to include feature win percentage dataframe as part of the routine artifacts.
      - Name: include_feature_win_percentage
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: bool
    - Target: Select Add to configure additional targets (5 max).
      - Name: initial_target_model_feature_selection
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Configure 1 target model option
      - Nested Model: Configure 1 target model option
        
        Required Input
        
        Target: Select target to train the ml_classification model(s).
        
        Name: target_selection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Features: Select features to train the ml classification model(s).
        
        Name: feature_selection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
        
        Include Raw Scores: Include raw scores in the prediction output. Please note: this method is not available for SVM Linear or Ridge Classifiers.
        
        Name: include_raw_scores
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: bool
        
        Only Train ML Models: Select this option to only train Machine Learning Models. These models are guaranteed to produce feature importance values.
        
        Name: ml_only
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: bool
        
        Model(s): Select model(s) to train, with the top model finalized if more than 1 selected.
        
        Name: model_selection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
        
        Optional Input
        
        Data Imbalance Technique: Select a re-balancing technique to address class imbalance in the target dataset.
        
        Name: fix_data_imbalance_method
        
        Tooltip:
        
        Validation Constraints:
        
        The input must have a minimum length of 1.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Optional[str]
    - Optional: Configure Additional Targets: Select Add to configure additional targets (5 max).
      - Name: additional_target_model_feature_selection
      - Tooltip:
        
        Validation Constraints:
        
        The input must have a maximum length of 5.
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[ClassificationModelFeatureSelectionV3]
    - Index Selection: Identifier column(s) to carry through training/holdout artifacts.
      - Name: index_selection
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
  - Optional Input
    - Optional: Transformation Method: The transformation method.
      - Name: transformation_method
      - Tooltip:
        
        Detail:
        
        Yeo-Johnson: Often the default choice, it can handle both positive and negative values.
        
        Box-Cox: Requires strictly positive data.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Optional[str]
    - Optional: Normalization Method: The normalization method.
      - Name: normalize_method
      - Tooltip:
        
        Detail:
        
        Z-Score: Standard scaling: centers data at 0 with a standard deviation of 1; works with both positive and negative values.
        
        Min-Max: MinMax scaling: scales data to a given range (default [0, 1]); works with all real numbers.
        
        Max-Abs: MaxAbs scaling: scales data by its maximum absolute value; preserves sparsity and works with both positive and negative values.
        
        Robust: Robust scaling: centers data using the median and scales using the interquartile range; less sensitive to outliers.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Optional[str]
- Artifacts:
  - Classification Report: A comprehensive Classification training report of the dataset along with relevant training data, metrics and charts.
    - Qualified Key Annotation: classification_report
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@classification_report/data_/html_content.html
        
        The html content.
  - Holdout Predictions: Holdout set prediction dataframe.
    - Qualified Key Annotation: holdout_dataset
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@holdout_dataset/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Confusion Matrices: Confusion matrices for all targets.
    - Qualified Key Annotation: confusion_matrices
    - Aggregate Artifact: True
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@confusion_matrices/data_
        
        Folder containing inner artifacts
    - Nested Artifacts: This collection includes Key-based collection of Artifacts
  - Confusion Matrices: Confusion matrices for all targets.
    - Qualified Key Annotation: N/A
  - Feature Importances: Feature importances for all targets. Targets trained on non ML models will not have these.
    - Qualified Key Annotation: feature_importances
    - Aggregate Artifact: True
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@feature_importances/data_
        
        Folder containing inner artifacts
    - Nested Artifacts: This collection includes Key-based collection of Artifacts
  - Feature Importances: Feature importances for all targets. Targets trained on non ML models will not have these.
    - Qualified Key Annotation: N/A
  - Model Reports: A dataframe containing the model reports for all models trained in the routine.
    - Qualified Key Annotation: model_reports
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@model_reports/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Shap Train Values: A dataframe containing the shap values for each class in every observation in the training dataset.
    - Qualified Key Annotation: shap_dataframe
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@shap_dataframe/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Roc Curves: The roc curves for each target. Dictionary keyed by target name.
    - Qualified Key Annotation: roc_curve
    - Aggregate Artifact: True
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@roc_curve/data_
        
        Folder containing inner artifacts
    - Nested Artifacts: This collection includes Key-based collection of Artifacts
  - Roc Curves: The roc curves for each target. Dictionary keyed by target name.
    - Qualified Key Annotation: N/A
  - Dynamic Artifacts Metadata: Contains metadata for the dynamic artifacts that are generated at runtime for this method.
    - Qualified Key Annotation: dynamic_artifacts_metadata
    - Aggregate Artifact: False
    - In-Memory Json Accessible: True
    - File Annotations:
      - artifacts_/@dynamic_artifacts_metadata/data_/data.json
        
        Stored json data.
      - artifacts_/@dynamic_artifacts_metadata/data_/schema.json
        
        The json schema of the json object stored in the 'data.json' file

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: MLClassifier

Method Name	Artifact Keys
`__init__`	N/A
`create_web_app`	web_app
`observation_shap_view`	observation_shap_plots
`predict`	prediction_report, prediction_output, shap_dataframe
`train`	classification_report, holdout_dataset, confusion_matrices, feature_importances, model_reports, shap_dataframe, roc_curve, dynamic_artifacts_metadata

Versions​

v4.0.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. Customer Churn (Binary Classification)​

2. Product Purchase Prediction (Binary Classification)​

3. Credit Risk Assessment (Multi-class Classification)​

4. Customer Feedback Analysis (Multi-output/Multi-label Classification)​

Routine Methods​

1. Init (Constructor)​

2. Create Web App (Method)​

3. Observation Shap View (Method)​

4. Predict (Method)​

5. Train (Method)​

Interface Definitions​

Developer Docs​