Skip to main content

Understanding SensibleAI Studio Routines and When to Use Them

Author: Ansley Wunderlich, Created: 2024-10-25

The purpose of this article is to dive deeper into 11 routines in SensibleAI Studio to get a better look at the advantages of each routine and where each provides the most value to users.

Aggregate Data

Overview

The Aggregate Data routine allows users to aggregate data based on specified columns and aggregation methods. It supports a variety of aggregation types, including:

  • Sum: Total values for specified groups
  • Mean: Average values within groups
  • Min: The smallest value in the group
  • Max: The largest value in the group
info

This routine is particularly useful for time series data and can handle multiple columns, making it a versatile option for data analysts.

Key Features

FeatureDescription
Flexible GroupingGroup by user-specified fields
Multiple AggregationsSupports various aggregation types for different columns
Easy Input ManagementOptions to continue, add, or modify aggregations

Use Cases of Aggregate Data

Aggregated Dimension Insights

A national retail chain with hundreds of stores generates extensive transaction data. To enhance performance insights, the chain employs the Aggregate Data routine to:

  1. Group Data by Regions: Aggregate sales and transaction data to understand performance across various locations.
  2. Generate Summary Statistics: Calculate totals, averages, and counts to identify high and low-performing stores.

By analyzing this aggregated data, the retail chain can develop targeted strategies for improvement and leverage advanced analytics, including machine learning, to predict trends and optimize inventory. This approach not only streamlines data management but also fosters a culture of continuous improvement.

Time Series Trend Analysis

To analyze consumer behavior over time, the retail chain uses the Aggregate Data routine to:

  • Analyze Sales Trends: Aggregate sales data across specific time intervals – daily, monthly, quarterly, or annually – to observe trends.
  • Adjust Marketing Strategies: Quickly adapt to market demands based on real-time insights from aggregated data.

As a consultant for a restaurant chain, you can help clients understand their sales data better by aggregating menu item sales. By utilizing the Aggregate Data routine, you can:

  • Calculate Average Sales: Determine the average sales for each category of menu item at individual locations.
  • Inform Future Forecasting: Use aggregated data to identify trends and inform future inventory and marketing strategies.

Routine Method Overview

Description

The Aggregate Data routine enables users to run aggregation routines by grouping specified fields and selecting aggregation types. Here’s how it works:

Input RequirementDescription
Source ConnectionConnection information for the data source (must be a TabularConnection)
Columns to GroupSpecify which columns to group by
Aggregation Step InputOptions to continue, add another column, or modify previous inputs

Output

The output of the Aggregate Data routine provides aggregated data based on user specifications, facilitating insightful analysis.

Summary of Benefits

BenefitDescription
Streamlined Data ManagementReduces effort in data aggregation tasks
Actionable InsightsProvides clarity for strategic decision-making
Enhanced Predictive AnalyticsSupports advanced forecasting and trend analysis
Customer-Centric ApproachImproves understanding of consumer behavior

Forecast Allocation

Overview

The Forecast Allocation routine expands on forecast outputs by allowing users to approximate sales at a granular level. By using historical datasets alongside forecasts, businesses can allocate predicted sales to individual products or stores.

Use Cases

  1. Products Within Stores: When forecasting overall sales across multiple stores, this routine helps estimate sales for individual products, even those not included in the original forecast. It requires setting dimension columns to match historical data and forecasting targets.
  2. Stores Within Regions by Month: For forecasts predicting sales across regions, the routine can provide detailed forecasts for individual stores within those regions, accounting for monthly sales variations.
  3. Large Scale Forecasting: When clients need forecasts for a large number of targets, the Forecast Allocation routine can help scale down the forecast to manageable levels, allocating values based on historical averages.

Routine Method Overview

Input Requirements

Users must provide historical data, define allocation and dimension columns, and specify date and value columns.

Output

The routine generates an allocation dataset reflecting the applied forecast.

Frequency Resampler

Overview

The Frequency Resampler is designed for time series data, allowing users to change the periodicity of their datasets. This can involve both upward aggregations (e.g., daily to weekly) and downward allocations (e.g., monthly to daily). The routine supports various summarization methods, such as sum and average, enabling efficient exploration of data trends.

Key Features

  • Flexibility in Periodicity: Users can quickly resample data to different frequencies, facilitating various modeling scenarios.
  • Aggregation Methods: Users can choose from multiple aggregation techniques to best fit their data analysis needs.

Use Cases

  • Data Exploration: For businesses like Customer A, the Resampler allows exploration of historical sales data at different granularities (daily, weekly, monthly) to optimize forecasting accuracy.
  • Anomaly Detection: Companies, such as Company B, can aggregate high-frequency IoT data into hourly or daily summaries to enhance anomaly detection capabilities.
  • Pre-processing for SensibleAI Forecast (FOR): Consultants can resample data before loading it into SensibleAI Forecast to ensure the accuracy of predictions at the desired granularity.

Routine Method Overview

The Resample routine requires various inputs:

  • Connection Type: TabularConnection, SQLTabularConnection, etc.
  • Frequency Specifications: Source and destination frequencies (e.g., daily to monthly).
  • Key Columns: Columns used as keys for the resampling process.

Output

The routine generates a resampled dataset that can be used for further analysis.

Kalman Filter V2

Overview

The Kalman Filter V2 excels at cleansing time series data by predicting and correcting estimates based on noisy measurements. It updates predictions iteratively, filtering out noise and revealing underlying trends.

Key Features

  • Noise Reduction: The filter balances predicted states against new measurements, enhancing the accuracy of time series data.
  • Dynamic Updates: It adapts continuously as new data arrives, making it ideal for dynamic environments like finance.

Use Cases

  • Handling Missing Data: The Kalman Filter is instrumental for businesses experiencing data gaps due to system outages or maintenance, ensuring continuity in data analysis.
  • Dealing with Anomalies: During events like the COVID-19 pandemic, the filter can identify and remove outliers from datasets, improving forecasting models.
  • Cleansing Time Series Data: It effectively corrects point-based anomalies, ensuring data integrity and reliability for predictive modeling.

Routine Methodology

The Kalman Filter V2 requires:

  • Configuration Method: Automatic or manual optimization of hyperparameters.
  • Connection Type: Similar to the Resample routine, it uses various connection types.
  • Dimension Columns: Specifies the columns used for filtering and cleansing.
Output

The routine provides cleansed data, including original and filtered values.

Model Forecast Stage

Overview

The Model Forecast Stage is designed to transform traditional forecasting tables from SensibleAI Forecast into a format suitable for ingestion into Forecast Value Add (FVA) dashboards. This routine simplifies the selection of top-performing models for each business target prediction, allowing users to focus on the most reliable forecasts.

Key Use Cases

  • Cascading Stage Best ML Models:

    • Scenario: A user updates their predictions and wants to filter for the best-ranked model per target.
    • Process: The user specifies a hierarchy for model selection: Best ML, Best Intelligent, Best, and Best Baseline. The routine trims forecast ranges to match actuals and avoids overlapping forecasts.
    • Outcome: A refined table comparing SensibleAI Forecast predictions against customer benchmarks.
  • Backtest Model Forecast:

    • Scenario: A consultant experiments with various project configurations and needs to evaluate their performance.
    • Process: The routine filters Backtest Model Forecast (BMF) tables, selecting top models based on specified criteria.
    • Outcome: Multiple FVA tables that feed into a Forecast Snapshot dashboard for direct comparison.
  • Implementation Comparisons:

    • Scenario: Consultants must provide clear comparisons between forecasts generated by SensibleAI Forecast and customer forecasts.
    • Outcome: A streamlined process for selecting models that enhances clarity and insight during client engagements.

Input and Output Specifications

Input ComponentDescription
Source ConnectionConnection details for accessing the source data (Tabular Connection)
Configure Convert TypesConfiguration options for converting data types as needed
Hierarchical TransformationsSelect hierarchical transformations for the model forecast table
Overlapped Forecasts HandlingOptions for managing overlapping forecasts: Use Latest, Use Oldest, No Merge
Forecast Bounds HandlingOptions for trimming forecast values relative to actual values
Actuals HandlingOptions for managing actuals from the DMF table: Remove, Copy per Version

Output

A staged data table with hierarchical selections of top-ranking models, ready for FVA analysis.

Example Output Schema

Column NameData TypeIs Nullable
ModelStringFalse
TargetNameStringFalse
ValueFloat64False
DateDateTimeFalse
ModelRankInt64False
PredictionCallIDObjectFalse
.........

Numeric Data Fill

Overview

The Numeric Data Fill routine addresses the challenge of null values in datasets, ensuring that analysis and machine learning models are based on complete data. This routine offers various strategies to fill missing values, thus enhancing data integrity.

Key Features

  • Filling Strategies:

    • Options include filling with zero, mean, median, mode, min, max, custom values, forward fill, and backward fill.
    • Forward and backward fills leverage the last known values for matching dimensions, adding contextual relevance.
  • Use Cases:

    • Scenario: A dataset has records but contains null values that could skew analysis.
    • Implementation: Users can choose an appropriate fill strategy based on the nature of the data.

Input and Output Specifications

Input ComponentDescription
Source Data DefinitionConnection details and specifications for the source data (TimeSeriesTable)
Dimension ColumnsColumns used as dimensions for filling
Date ColumnColumn representing the date
Value ColumnColumn containing the values to fill
Data Fill DefinitionSpecifies fill strategies for columns

Output

A data table where missing values have been filled according to the specified strategies.

Prediction Simulator

Overview

The Prediction Simulator allows users to manage and execute multiple data jobs on any project that has passed the data load step in FOR. It replaces the traditional SIM solution, providing a streamlined process for running jobs such as pipeline, deploy, prediction, model rebuild, and project copy. Importantly, users can upload all necessary source data, and the simulator handles updates based on user-defined dates.

Key Features

  • Automated Job Management: Users can schedule jobs to run in a specific order, reducing the risk of projects sitting idle.
  • User-Friendly Scheduling: Allows running multiple jobs overnight without needing to monitor them actively.

Use Cases

  • Busy Consultants: Ideal for consultants juggling multiple projects across various environments, enabling efficient job management.
  • Overnight Processing: Users can execute a series of jobs overnight, ensuring all tasks complete by morning.

Routine Methods

MethodDescriptionMemory Capacity
Simulate PredictionsSimulates and runs a specified list of tasks in order2.0 GB
ConstructorInitializes the prediction simulator routine0 GB

Principal Component Analysis

Overview

PCA is a statistical technique used for dimensionality reduction, data compression, and feature extraction. It identifies the principal components that capture the most variance in the data, simplifying complex datasets while retaining essential information.

Key Features

  • Dimensionality Reduction: Reduces complexity by transforming datasets into principal components.
  • Enhanced Visualization: Makes it easier to analyze and visualize high-dimensional data.

Use Cases

  • Anomaly Detection: Identifies unusual patterns in transaction data, aiding in fraud detection.
  • Forecasting: Simplifies forecasting models by identifying significant components from various features.

Routine Methods

MethodDescriptionMemory Capacity
Run PCAPreprocesses and runs PCA on the input dataset.2.0 GB

Replace Special Characters

Overview

This routine focuses on cleansing datasets by identifying and replacing special characters based on a defined schema. It allows users to target specific columns, ensuring data consistency and validity.

Key Features

  • Customizable Cleansing: Users can define multiple find-and-replace operations for various columns.
  • Improved Data Quality: Ensures data is clean and ready for analysis, reducing errors in subsequent processing.

Use Cases

  • Data Standardization: Helps standardize entity identifiers in datasets for accurate forecasting and analysis.
  • Error Prevention: Cleanses unrecognized characters that could cause errors during data ingestion.

Routine Methods

MethodDescriptionMemory Capacity
Cleanse DataFinds and replaces special characters based on user input.2.0 GB

Target Flagging Analysis

Overview

In data analytics, understanding metrics is crucial for making informed decisions. The Target Flagging Analysis (Stateless) routine is designed to calculate various metrics based on both source data and model forecast data. This routine allows users to identify key performance indicators, assess forecast accuracy, and flag potential issues within target dimensions. In this article, we will explore the functionalities, use cases, and different routines associated with this analysis.

The Target Flagging Analysis routine can generate a variety of metrics, enabling users to evaluate data quality and forecast accuracy. Here’s a breakdown of its key components:

Key Metrics Generated

MetricSource DataModel Forecast Data
Actuals Summation✔️
Target Start/End Date✔️✔️
Collection Lag Days/Periods✔️
Start Up Lag Days/Periods✔️
IsForecastable✔️
Local Density✔️
Global Density✔️
Mean Absolute Error (MAE)✔️
Root Mean Squared Error✔️
Bias Error✔️
Growth Rate✔️

Visual Outputs

The routine also generates plots comparing actual values against MAE% and Score%. These visualizations aid in identifying areas needing attention.

Use Cases

The routine supports multiple use cases, each tailored to specific user needs. Here’s a summary of each:

1. Generate All Metrics Without Flags

This routine creates metrics tables for source and model forecast data, helping users understand the overall data landscape without flagging any issues.

Output:

  • Metrics tables for both source data and forecast data
  • MAE vs Actuals and Score vs Actuals plots

2. Generate Source Metrics Without Flags

This is ideal for data analysts interested in insights from source time series data alone.

Output:

  • Source data metrics table
  • Excludes plots and flagging

3. Generate Forecast Metrics Without Flags

This routine focuses on the forecast data, providing insights into its accuracy and quality.

Output:

  • Forecast data metrics table
  • MAE vs Actuals and Score vs Actuals plots

4. All Metrics Analysis

As an implementation consultant, this routine assists in evaluating forecasts across all targets, helping to pinpoint inaccuracies caused by limited data points.

Output:

Combined metrics table for source and forecast data

Detailed Routine Methods

The routine consists of three primary methods, each serving a unique purpose. Below is a table summarizing their functionalities:

Routine MethodDescriptionTarget FlaggingHTML Output
Source Metrics AnalysisGenerates source metrics without flaggingNoInteractive report
Forecast Metrics AnalysisGenerates forecast metrics without flaggingNoInteractive report
All Metrics AnalysisCombines source and forecast metrics without flaggingNoComprehensive report

Input Requirements

Each method has specific input requirements, such as data connection types and dimension specifications. Common inputs include:

  • Source Data Definition: Must be a TimeSeriesTableDefinition
  • Connection: Can be SQLTabularConnection, FileTabularConnection, etc.
  • Dimension Columns, Date Column, Value Column: Specify the relevant columns

Output Formats

The outputs can be generated in various formats, including HTML reports and Parquet files, ensuring flexibility for users in analyzing their data.

Time Series Data Analysis

In the realm of data analytics, time series analysis plays a crucial role, particularly for businesses that track data across various targets. This article explores the Time Series Data Analysis routine, designed to enhance understanding and streamline insights from time series datasets.

Use Cases

1. Implementation Insights for Retail

As an implementation consultant, you may find yourself working with a retail customer who tracks daily data across multiple targets. Understanding the underlying trends in their data can be challenging, especially if it is new territory for both you and the client. The Time Series Data Analysis routine allows you to quickly generate insights and share findings, enhancing both your understanding and that of the customer before moving to the next phases of implementation.

2. Quality Assurance in FOR Projects

Quality checks are critical in validating dataset integrity. A comprehensive time series analysis report can significantly reduce implementation timelines. By generating target-level statistics and visuals, the routine not only expedites the process of deriving actionable insights but also promotes transparent communication with the client regarding data quality.

3. Exploratory Data Analysis

This routine serves a diverse audience—data scientists, analysts, and business professionals alike—helping them extract meaningful insights from time series data. The comprehensive report it generates includes visualizations and interpretations, facilitating a deeper understanding of the dataset.

Routine Methods

Overview of Routine Methods

The Time Series Data Analysis routine offers two primary methods: Generic Analysis and Advanced Analysis. Each generates an HTML report rich in statistics and visualizations but caters to different needs.

Routine MethodDescriptionKey Features
Generic AnalysisBasic interactive report using YData ProfilingHigh-level dataset summaries, alerts on stationarity, seasonality, distributions
Advanced AnalysisComprehensive custom reportFilterable metrics summary, detailed visualizations, target-level plots

Generic Analysis

The Generic Analysis method employs the open-source YData Profiling library to generate a report that includes:

Alerts about data characteristics like stationarity and seasonality.

A correlation matrix (optional) to help visualize relationships between dimensions.

Required Inputs:
  • Source Data Definition
  • Connection to source data
  • Dimension, date, and value columns
  • Title for the report
Output:

Generic Time Series Report in HTML format, providing an overview of the dataset.

Advanced Analysis

The Advanced Analysis method goes further by creating a more tailored report that includes:

  • A filterable summary table with key metrics
  • Time series decomposition plots
  • Auto-correlation and partial auto-correlation plots
Required Inputs:

Similar to the Generic Analysis, but also includes options for target-level plots

Output:

An Advanced Time Series Report in HTML format, rich with detailed analysis.

Detailed Comparison of Routine Methods

To better illustrate the differences between the two routine methods, here’s a summary table highlighting their features:

FeatureGeneric AnalysisAdvanced Analysis
Utilizes YData ProfilingYesNo
CustomizationLimitedExtensive
Summary StatisticsYesYes
Time Series Decomposition PlotsNoYes
Auto-Correlation PlotsYesYes
Warning Flags for MetricsYesYes
Filterable Summary TableNoYes
Correlation MatrixOptionalNot available

Comparative Analysis of FOR Routines

The following table summarizes key features, input types, and memory capacities for each routine, providing a quick reference for users:

RoutinePurposeInput TypesKey OutputsMemory Capacity
Aggregate DataData consolidation and summarizationVarious data sourcesUnified data views2.0 GB
Forecast AllocationDistributes forecasted valuesTabular dataAllocated forecast data2.0 GB
Frequency ResamplerResamples time series dataTimeSeriesTableResampled data2.0 GB
Kalman Filter V2Refines forecasts with filteringTime series dataSmoothed forecasts2.0 GB
Model Forecast StagePrepares data for FVA analysisTabular connectionStaged forecast data3.0 GB
Numeric Data FillFills missing valuesTimeSeriesTableDefinitionComplete datasets2.0 GB
Prediction SimulatorAutomates job executionProject dataScheduled job reports2.0 GB
Principal Component AnalysisReduces dimensionalityData matricesPrincipal components and visualizations2.0 GB
Replace Special CharactersCleanses data of special charactersTabular dataCleansed datasets2.0 GB
Target Flagging AnalysisEvaluates performance metricsTimeSeriesTableMetrics reports with visual outputs2.0 GB
Time Series Data AnalysisAnalyzes time series dataTimeSeriesTableDetailed analysis reports2.0 GB

Was this page helpful?