Understanding SensibleAI Studio Routines and When to Use Them

Author: Ansley Wunderlich, Created: 2024-10-25

The purpose of this article is to dive deeper into 11 routines in SensibleAI Studio to get a better look at the advantages of each routine and where each provides the most value to users.

Aggregate Data

Overview

The Aggregate Data routine allows users to aggregate data based on specified columns and aggregation methods. It supports a variety of aggregation types, including:

Sum: Total values for specified groups
Mean: Average values within groups
Min: The smallest value in the group
Max: The largest value in the group

info

This routine is particularly useful for time series data and can handle multiple columns, making it a versatile option for data analysts.

Key Features

Feature	Description
Flexible Grouping	Group by user-specified fields
Multiple Aggregations	Supports various aggregation types for different columns
Easy Input Management	Options to continue, add, or modify aggregations

Use Cases of Aggregate Data

Aggregated Dimension Insights

A national retail chain with hundreds of stores generates extensive transaction data. To enhance performance insights, the chain employs the Aggregate Data routine to:

Group Data by Regions: Aggregate sales and transaction data to understand performance across various locations.
Generate Summary Statistics: Calculate totals, averages, and counts to identify high and low-performing stores.

By analyzing this aggregated data, the retail chain can develop targeted strategies for improvement and leverage advanced analytics, including machine learning, to predict trends and optimize inventory. This approach not only streamlines data management but also fosters a culture of continuous improvement.

Time Series Trend Analysis

To analyze consumer behavior over time, the retail chain uses the Aggregate Data routine to:

Analyze Sales Trends: Aggregate sales data across specific time intervals – daily, monthly, quarterly, or annually – to observe trends.
Adjust Marketing Strategies: Quickly adapt to market demands based on real-time insights from aggregated data.

As a consultant for a restaurant chain, you can help clients understand their sales data better by aggregating menu item sales. By utilizing the Aggregate Data routine, you can:

Calculate Average Sales: Determine the average sales for each category of menu item at individual locations.
Inform Future Forecasting: Use aggregated data to identify trends and inform future inventory and marketing strategies.

Routine Method Overview

Description

The Aggregate Data routine enables users to run aggregation routines by grouping specified fields and selecting aggregation types. Here’s how it works:

Input Requirement	Description
Source Connection	Connection information for the data source (must be a TabularConnection)
Columns to Group	Specify which columns to group by
Aggregation Step Input	Options to continue, add another column, or modify previous inputs

Output

The output of the Aggregate Data routine provides aggregated data based on user specifications, facilitating insightful analysis.

Summary of Benefits

Benefit	Description
Streamlined Data Management	Reduces effort in data aggregation tasks
Actionable Insights	Provides clarity for strategic decision-making
Enhanced Predictive Analytics	Supports advanced forecasting and trend analysis
Customer-Centric Approach	Improves understanding of consumer behavior

Forecast Allocation

Overview

The Forecast Allocation routine expands on forecast outputs by allowing users to approximate sales at a granular level. By using historical datasets alongside forecasts, businesses can allocate predicted sales to individual products or stores.

Use Cases

Products Within Stores: When forecasting overall sales across multiple stores, this routine helps estimate sales for individual products, even those not included in the original forecast. It requires setting dimension columns to match historical data and forecasting targets.
Stores Within Regions by Month: For forecasts predicting sales across regions, the routine can provide detailed forecasts for individual stores within those regions, accounting for monthly sales variations.
Large Scale Forecasting: When clients need forecasts for a large number of targets, the Forecast Allocation routine can help scale down the forecast to manageable levels, allocating values based on historical averages.

Routine Method Overview

Input Requirements

Users must provide historical data, define allocation and dimension columns, and specify date and value columns.

Output

The routine generates an allocation dataset reflecting the applied forecast.

Frequency Resampler

Overview

The Frequency Resampler is designed for time series data, allowing users to change the periodicity of their datasets. This can involve both upward aggregations (e.g., daily to weekly) and downward allocations (e.g., monthly to daily). The routine supports various summarization methods, such as sum and average, enabling efficient exploration of data trends.

Key Features

Flexibility in Periodicity: Users can quickly resample data to different frequencies, facilitating various modeling scenarios.
Aggregation Methods: Users can choose from multiple aggregation techniques to best fit their data analysis needs.

Use Cases

Data Exploration: For businesses like Customer A, the Resampler allows exploration of historical sales data at different granularities (daily, weekly, monthly) to optimize forecasting accuracy.
Anomaly Detection: Companies, such as Company B, can aggregate high-frequency IoT data into hourly or daily summaries to enhance anomaly detection capabilities.
Pre-processing for SensibleAI Forecast (FOR): Consultants can resample data before loading it into SensibleAI Forecast to ensure the accuracy of predictions at the desired granularity.

Routine Method Overview

The Resample routine requires various inputs:

Connection Type: TabularConnection, SQLTabularConnection, etc.
Frequency Specifications: Source and destination frequencies (e.g., daily to monthly).
Key Columns: Columns used as keys for the resampling process.

Output

The routine generates a resampled dataset that can be used for further analysis.

Kalman Filter V2

Overview

The Kalman Filter V2 excels at cleansing time series data by predicting and correcting estimates based on noisy measurements. It updates predictions iteratively, filtering out noise and revealing underlying trends.

Key Features

Noise Reduction: The filter balances predicted states against new measurements, enhancing the accuracy of time series data.
Dynamic Updates: It adapts continuously as new data arrives, making it ideal for dynamic environments like finance.

Use Cases

Handling Missing Data: The Kalman Filter is instrumental for businesses experiencing data gaps due to system outages or maintenance, ensuring continuity in data analysis.
Dealing with Anomalies: During events like the COVID-19 pandemic, the filter can identify and remove outliers from datasets, improving forecasting models.
Cleansing Time Series Data: It effectively corrects point-based anomalies, ensuring data integrity and reliability for predictive modeling.

Routine Methodology

The Kalman Filter V2 requires:

Configuration Method: Automatic or manual optimization of hyperparameters.
Connection Type: Similar to the Resample routine, it uses various connection types.
Dimension Columns: Specifies the columns used for filtering and cleansing.

Output

The routine provides cleansed data, including original and filtered values.

Model Forecast Stage

Overview

The Model Forecast Stage is designed to transform traditional forecasting tables from SensibleAI Forecast into a format suitable for ingestion into Forecast Value Add (FVA) dashboards. This routine simplifies the selection of top-performing models for each business target prediction, allowing users to focus on the most reliable forecasts.

Key Use Cases

Cascading Stage Best ML Models:
- Scenario: A user updates their predictions and wants to filter for the best-ranked model per target.
- Process: The user specifies a hierarchy for model selection: Best ML, Best Intelligent, Best, and Best Baseline. The routine trims forecast ranges to match actuals and avoids overlapping forecasts.
- Outcome: A refined table comparing SensibleAI Forecast predictions against customer benchmarks.
Backtest Model Forecast:
- Scenario: A consultant experiments with various project configurations and needs to evaluate their performance.
- Process: The routine filters Backtest Model Forecast (BMF) tables, selecting top models based on specified criteria.
- Outcome: Multiple FVA tables that feed into a Forecast Snapshot dashboard for direct comparison.
Implementation Comparisons:
- Scenario: Consultants must provide clear comparisons between forecasts generated by SensibleAI Forecast and customer forecasts.
- Outcome: A streamlined process for selecting models that enhances clarity and insight during client engagements.

Input and Output Specifications

Input Component	Description
Source Connection	Connection details for accessing the source data (Tabular Connection)
Configure Convert Types	Configuration options for converting data types as needed
Hierarchical Transformations	Select hierarchical transformations for the model forecast table
Overlapped Forecasts Handling	Options for managing overlapping forecasts: Use Latest, Use Oldest, No Merge
Forecast Bounds Handling	Options for trimming forecast values relative to actual values
Actuals Handling	Options for managing actuals from the DMF table: Remove, Copy per Version

Output

A staged data table with hierarchical selections of top-ranking models, ready for FVA analysis.

Example Output Schema

Column Name	Data Type	Is Nullable
Model	String	False
TargetName	String	False
Value	Float64	False
Date	DateTime	False
ModelRank	Int64	False
PredictionCallID	Object	False
...	...	...

Numeric Data Fill

Overview

The Numeric Data Fill routine addresses the challenge of null values in datasets, ensuring that analysis and machine learning models are based on complete data. This routine offers various strategies to fill missing values, thus enhancing data integrity.

Key Features

Filling Strategies:
- Options include filling with zero, mean, median, mode, min, max, custom values, forward fill, and backward fill.
- Forward and backward fills leverage the last known values for matching dimensions, adding contextual relevance.
Use Cases:
- Scenario: A dataset has records but contains null values that could skew analysis.
- Implementation: Users can choose an appropriate fill strategy based on the nature of the data.

Input and Output Specifications

Input Component	Description
Source Data Definition	Connection details and specifications for the source data (TimeSeriesTable)
Dimension Columns	Columns used as dimensions for filling
Date Column	Column representing the date
Value Column	Column containing the values to fill
Data Fill Definition	Specifies fill strategies for columns

Output

A data table where missing values have been filled according to the specified strategies.

Prediction Simulator

Overview

The Prediction Simulator allows users to manage and execute multiple data jobs on any project that has passed the data load step in FOR. It replaces the traditional SIM solution, providing a streamlined process for running jobs such as pipeline, deploy, prediction, model rebuild, and project copy. Importantly, users can upload all necessary source data, and the simulator handles updates based on user-defined dates.

Key Features

Automated Job Management: Users can schedule jobs to run in a specific order, reducing the risk of projects sitting idle.
User-Friendly Scheduling: Allows running multiple jobs overnight without needing to monitor them actively.

Use Cases

Busy Consultants: Ideal for consultants juggling multiple projects across various environments, enabling efficient job management.
Overnight Processing: Users can execute a series of jobs overnight, ensuring all tasks complete by morning.

Routine Methods

Method	Description	Memory Capacity
Simulate Predictions	Simulates and runs a specified list of tasks in order	2.0 GB
Constructor	Initializes the prediction simulator routine	0 GB

Principal Component Analysis

Overview

PCA is a statistical technique used for dimensionality reduction, data compression, and feature extraction. It identifies the principal components that capture the most variance in the data, simplifying complex datasets while retaining essential information.

Key Features

Dimensionality Reduction: Reduces complexity by transforming datasets into principal components.
Enhanced Visualization: Makes it easier to analyze and visualize high-dimensional data.

Use Cases

Anomaly Detection: Identifies unusual patterns in transaction data, aiding in fraud detection.
Forecasting: Simplifies forecasting models by identifying significant components from various features.

Routine Methods

Method	Description	Memory Capacity
Run PCA	Preprocesses and runs PCA on the input dataset.	2.0 GB

Replace Special Characters

Overview

This routine focuses on cleansing datasets by identifying and replacing special characters based on a defined schema. It allows users to target specific columns, ensuring data consistency and validity.

Key Features

Customizable Cleansing: Users can define multiple find-and-replace operations for various columns.
Improved Data Quality: Ensures data is clean and ready for analysis, reducing errors in subsequent processing.

Use Cases

Data Standardization: Helps standardize entity identifiers in datasets for accurate forecasting and analysis.
Error Prevention: Cleanses unrecognized characters that could cause errors during data ingestion.

Routine Methods

Method	Description	Memory Capacity
Cleanse Data	Finds and replaces special characters based on user input.	2.0 GB

Target Flagging Analysis

Overview

In data analytics, understanding metrics is crucial for making informed decisions. The Target Flagging Analysis (Stateless) routine is designed to calculate various metrics based on both source data and model forecast data. This routine allows users to identify key performance indicators, assess forecast accuracy, and flag potential issues within target dimensions. In this article, we will explore the functionalities, use cases, and different routines associated with this analysis.

The Target Flagging Analysis routine can generate a variety of metrics, enabling users to evaluate data quality and forecast accuracy. Here’s a breakdown of its key components:

Key Metrics Generated

Metric	Source Data	Model Forecast Data
Actuals Summation	✔️
Target Start/End Date	✔️	✔️
Collection Lag Days/Periods	✔️
Start Up Lag Days/Periods	✔️
IsForecastable	✔️
Local Density	✔️
Global Density	✔️
Mean Absolute Error (MAE)		✔️
Root Mean Squared Error		✔️
Bias Error		✔️
Growth Rate		✔️

Visual Outputs

The routine also generates plots comparing actual values against MAE% and Score%. These visualizations aid in identifying areas needing attention.

Use Cases

The routine supports multiple use cases, each tailored to specific user needs. Here’s a summary of each:

1. Generate All Metrics Without Flags

This routine creates metrics tables for source and model forecast data, helping users understand the overall data landscape without flagging any issues.

Output:

Metrics tables for both source data and forecast data
MAE vs Actuals and Score vs Actuals plots

2. Generate Source Metrics Without Flags

This is ideal for data analysts interested in insights from source time series data alone.

Output:

Source data metrics table
Excludes plots and flagging

3. Generate Forecast Metrics Without Flags

This routine focuses on the forecast data, providing insights into its accuracy and quality.

Output:

Forecast data metrics table
MAE vs Actuals and Score vs Actuals plots

4. All Metrics Analysis

As an implementation consultant, this routine assists in evaluating forecasts across all targets, helping to pinpoint inaccuracies caused by limited data points.

Output:

Combined metrics table for source and forecast data

Detailed Routine Methods

The routine consists of three primary methods, each serving a unique purpose. Below is a table summarizing their functionalities:

Routine Method	Description	Target Flagging	HTML Output
Source Metrics Analysis	Generates source metrics without flagging	No	Interactive report
Forecast Metrics Analysis	Generates forecast metrics without flagging	No	Interactive report
All Metrics Analysis	Combines source and forecast metrics without flagging	No	Comprehensive report

Input Requirements

Each method has specific input requirements, such as data connection types and dimension specifications. Common inputs include:

Source Data Definition: Must be a TimeSeriesTableDefinition
Connection: Can be SQLTabularConnection, FileTabularConnection, etc.
Dimension Columns, Date Column, Value Column: Specify the relevant columns

Output Formats

The outputs can be generated in various formats, including HTML reports and Parquet files, ensuring flexibility for users in analyzing their data.

Time Series Data Analysis

In the realm of data analytics, time series analysis plays a crucial role, particularly for businesses that track data across various targets. This article explores the Time Series Data Analysis routine, designed to enhance understanding and streamline insights from time series datasets.

Use Cases

1. Implementation Insights for Retail

As an implementation consultant, you may find yourself working with a retail customer who tracks daily data across multiple targets. Understanding the underlying trends in their data can be challenging, especially if it is new territory for both you and the client. The Time Series Data Analysis routine allows you to quickly generate insights and share findings, enhancing both your understanding and that of the customer before moving to the next phases of implementation.

2. Quality Assurance in FOR Projects

Quality checks are critical in validating dataset integrity. A comprehensive time series analysis report can significantly reduce implementation timelines. By generating target-level statistics and visuals, the routine not only expedites the process of deriving actionable insights but also promotes transparent communication with the client regarding data quality.

3. Exploratory Data Analysis

This routine serves a diverse audience—data scientists, analysts, and business professionals alike—helping them extract meaningful insights from time series data. The comprehensive report it generates includes visualizations and interpretations, facilitating a deeper understanding of the dataset.

Routine Methods

Overview of Routine Methods

The Time Series Data Analysis routine offers two primary methods: Generic Analysis and Advanced Analysis. Each generates an HTML report rich in statistics and visualizations but caters to different needs.

Routine Method	Description	Key Features
Generic Analysis	Basic interactive report using YData Profiling	High-level dataset summaries, alerts on stationarity, seasonality, distributions
Advanced Analysis	Comprehensive custom report	Filterable metrics summary, detailed visualizations, target-level plots

Generic Analysis

The Generic Analysis method employs the open-source YData Profiling library to generate a report that includes:

Alerts about data characteristics like stationarity and seasonality.

A correlation matrix (optional) to help visualize relationships between dimensions.

Required Inputs:

Source Data Definition
Connection to source data
Dimension, date, and value columns
Title for the report

Output:

A Generic Time Series Report in HTML format, providing an overview of the dataset.

Advanced Analysis

The Advanced Analysis method goes further by creating a more tailored report that includes:

A filterable summary table with key metrics
Time series decomposition plots
Auto-correlation and partial auto-correlation plots

Required Inputs:

Similar to the Generic Analysis, but also includes options for target-level plots

Output:

An Advanced Time Series Report in HTML format, rich with detailed analysis.

Detailed Comparison of Routine Methods

To better illustrate the differences between the two routine methods, here’s a summary table highlighting their features:

Feature	Generic Analysis	Advanced Analysis
Utilizes YData Profiling	Yes	No
Customization	Limited	Extensive
Summary Statistics	Yes	Yes
Time Series Decomposition Plots	No	Yes
Auto-Correlation Plots	Yes	Yes
Warning Flags for Metrics	Yes	Yes
Filterable Summary Table	No	Yes
Correlation Matrix	Optional	Not available

Comparative Analysis of FOR Routines

The following table summarizes key features, input types, and memory capacities for each routine, providing a quick reference for users:

Routine	Purpose	Input Types	Key Outputs	Memory Capacity
Aggregate Data	Data consolidation and summarization	Various data sources	Unified data views	2.0 GB
Forecast Allocation	Distributes forecasted values	Tabular data	Allocated forecast data	2.0 GB
Frequency Resampler	Resamples time series data	TimeSeriesTable	Resampled data	2.0 GB
Kalman Filter V2	Refines forecasts with filtering	Time series data	Smoothed forecasts	2.0 GB
Model Forecast Stage	Prepares data for FVA analysis	Tabular connection	Staged forecast data	3.0 GB
Numeric Data Fill	Fills missing values	TimeSeriesTableDefinition	Complete datasets	2.0 GB
Prediction Simulator	Automates job execution	Project data	Scheduled job reports	2.0 GB
Principal Component Analysis	Reduces dimensionality	Data matrices	Principal components and visualizations	2.0 GB
Replace Special Characters	Cleanses data of special characters	Tabular data	Cleansed datasets	2.0 GB
Target Flagging Analysis	Evaluates performance metrics	TimeSeriesTable	Metrics reports with visual outputs	2.0 GB
Time Series Data Analysis	Analyzes time series data	TimeSeriesTable	Detailed analysis reports	2.0 GB

Aggregate Data​

Overview​

Key Features​

Use Cases of Aggregate Data​

Aggregated Dimension Insights​

Time Series Trend Analysis​

Menu Item Sales Analysis for Restaurants​

Routine Method Overview​

Description​

Output​

Summary of Benefits​

Forecast Allocation​

Overview​

Use Cases​

Routine Method Overview​

Input Requirements​

Output​

Frequency Resampler​

Overview​

Key Features​

Use Cases​

Routine Method Overview​

Output​

Kalman Filter V2​

Overview​

Key Features​

Use Cases​

Routine Methodology​

Output​

Model Forecast Stage​

Overview​

Key Use Cases​

Input and Output Specifications​

Output​

Example Output Schema​

Numeric Data Fill​

Overview​

Key Features​

Input and Output Specifications​

Output​

Prediction Simulator​

Overview​

Key Features​

Use Cases​

Routine Methods​

Principal Component Analysis​

Overview​

Key Features​

Use Cases​

Routine Methods​

Replace Special Characters​

Overview​

Key Features​

Use Cases​

Routine Methods​

Target Flagging Analysis​

Overview​

Key Metrics Generated​

Visual Outputs​

Use Cases​

Detailed Routine Methods​

Input Requirements​

Output Formats​

Time Series Data Analysis​

Use Cases​

Routine Methods​

Overview of Routine Methods​

Generic Analysis​

Required Inputs:​

Output:​

Advanced Analysis​

Required Inputs:​

Output:​

Detailed Comparison of Routine Methods​

Comparative Analysis of FOR Routines​

Aggregate Data

Overview

Key Features

Use Cases of Aggregate Data

Aggregated Dimension Insights

Time Series Trend Analysis

Menu Item Sales Analysis for Restaurants

Routine Method Overview

Description

Output

Summary of Benefits

Forecast Allocation

Overview

Use Cases

Routine Method Overview

Input Requirements

Output

Frequency Resampler

Overview

Key Features

Use Cases

Routine Method Overview

Output

Kalman Filter V2

Overview

Key Features

Use Cases

Routine Methodology

Output

Model Forecast Stage

Overview

Key Use Cases

Input and Output Specifications

Output

Example Output Schema

Numeric Data Fill

Overview

Key Features

Input and Output Specifications

Output

Prediction Simulator

Overview

Key Features

Use Cases

Routine Methods

Principal Component Analysis

Overview

Key Features

Use Cases

Routine Methods

Replace Special Characters

Overview

Key Features

Use Cases

Routine Methods

Target Flagging Analysis

Overview

Key Metrics Generated

Visual Outputs

Use Cases

Detailed Routine Methods

Input Requirements

Output Formats

Time Series Data Analysis

Use Cases

Routine Methods

Overview of Routine Methods

Generic Analysis

Required Inputs:

Output:

Advanced Analysis

Required Inputs:

Output:

Detailed Comparison of Routine Methods

Comparative Analysis of FOR Routines