Understanding SensibleAI Studio Routines and When to Use Them
The purpose of this article is to dive deeper into 11 routines in SensibleAI Studio to get a better look at the advantages of each routine and where each provides the most value to users.
Aggregate Data
Overview
The Aggregate Data routine allows users to aggregate data based on specified columns and aggregation methods. It supports a variety of aggregation types, including:
- Sum: Total values for specified groups
- Mean: Average values within groups
- Min: The smallest value in the group
- Max: The largest value in the group
This routine is particularly useful for time series data and can handle multiple columns, making it a versatile option for data analysts.
Key Features
Feature | Description |
---|---|
Flexible Grouping | Group by user-specified fields |
Multiple Aggregations | Supports various aggregation types for different columns |
Easy Input Management | Options to continue, add, or modify aggregations |
Use Cases of Aggregate Data
Aggregated Dimension Insights
A national retail chain with hundreds of stores generates extensive transaction data. To enhance performance insights, the chain employs the Aggregate Data routine to:
- Group Data by Regions: Aggregate sales and transaction data to understand performance across various locations.
- Generate Summary Statistics: Calculate totals, averages, and counts to identify high and low-performing stores.
By analyzing this aggregated data, the retail chain can develop targeted strategies for improvement and leverage advanced analytics, including machine learning, to predict trends and optimize inventory. This approach not only streamlines data management but also fosters a culture of continuous improvement.
Time Series Trend Analysis
To analyze consumer behavior over time, the retail chain uses the Aggregate Data routine to:
- Analyze Sales Trends: Aggregate sales data across specific time intervals – daily, monthly, quarterly, or annually – to observe trends.
- Adjust Marketing Strategies: Quickly adapt to market demands based on real-time insights from aggregated data.
Menu Item Sales Analysis for Restaurants
As a consultant for a restaurant chain, you can help clients understand their sales data better by aggregating menu item sales. By utilizing the Aggregate Data routine, you can:
- Calculate Average Sales: Determine the average sales for each category of menu item at individual locations.
- Inform Future Forecasting: Use aggregated data to identify trends and inform future inventory and marketing strategies.
Routine Method Overview
Description
The Aggregate Data routine enables users to run aggregation routines by grouping specified fields and selecting aggregation types. Here’s how it works:
Input Requirement | Description |
---|---|
Source Connection | Connection information for the data source (must be a TabularConnection) |
Columns to Group | Specify which columns to group by |
Aggregation Step Input | Options to continue, add another column, or modify previous inputs |
Output
The output of the Aggregate Data routine provides aggregated data based on user specifications, facilitating insightful analysis.
Summary of Benefits
Benefit | Description |
---|---|
Streamlined Data Management | Reduces effort in data aggregation tasks |
Actionable Insights | Provides clarity for strategic decision-making |
Enhanced Predictive Analytics | Supports advanced forecasting and trend analysis |
Customer-Centric Approach | Improves understanding of consumer behavior |
Forecast Allocation
Overview
The Forecast Allocation routine expands on forecast outputs by allowing users to approximate sales at a granular level. By using historical datasets alongside forecasts, businesses can allocate predicted sales to individual products or stores.
Use Cases
- Products Within Stores: When forecasting overall sales across multiple stores, this routine helps estimate sales for individual products, even those not included in the original forecast. It requires setting dimension columns to match historical data and forecasting targets.
- Stores Within Regions by Month: For forecasts predicting sales across regions, the routine can provide detailed forecasts for individual stores within those regions, accounting for monthly sales variations.
- Large Scale Forecasting: When clients need forecasts for a large number of targets, the Forecast Allocation routine can help scale down the forecast to manageable levels, allocating values based on historical averages.
Routine Method Overview
Input Requirements
Users must provide historical data, define allocation and dimension columns, and specify date and value columns.
Output
The routine generates an allocation dataset reflecting the applied forecast.
Frequency Resampler
Overview
The Frequency Resampler is designed for time series data, allowing users to change the periodicity of their datasets. This can involve both upward aggregations (e.g., daily to weekly) and downward allocations (e.g., monthly to daily). The routine supports various summarization methods, such as sum and average, enabling efficient exploration of data trends.
Key Features
- Flexibility in Periodicity: Users can quickly resample data to different frequencies, facilitating various modeling scenarios.
- Aggregation Methods: Users can choose from multiple aggregation techniques to best fit their data analysis needs.
Use Cases
- Data Exploration: For businesses like Customer A, the Resampler allows exploration of historical sales data at different granularities (daily, weekly, monthly) to optimize forecasting accuracy.
- Anomaly Detection: Companies, such as Company B, can aggregate high-frequency IoT data into hourly or daily summaries to enhance anomaly detection capabilities.
- Pre-processing for SensibleAI Forecast (FOR): Consultants can resample data before loading it into SensibleAI Forecast to ensure the accuracy of predictions at the desired granularity.
Routine Method Overview
The Resample routine requires various inputs:
- Connection Type: TabularConnection, SQLTabularConnection, etc.
- Frequency Specifications: Source and destination frequencies (e.g., daily to monthly).
- Key Columns: Columns used as keys for the resampling process.
Output
The routine generates a resampled dataset that can be used for further analysis.
Kalman Filter V2
Overview
The Kalman Filter V2 excels at cleansing time series data by predicting and correcting estimates based on noisy measurements. It updates predictions iteratively, filtering out noise and revealing underlying trends.
Key Features
- Noise Reduction: The filter balances predicted states against new measurements, enhancing the accuracy of time series data.
- Dynamic Updates: It adapts continuously as new data arrives, making it ideal for dynamic environments like finance.
Use Cases
- Handling Missing Data: The Kalman Filter is instrumental for businesses experiencing data gaps due to system outages or maintenance, ensuring continuity in data analysis.
- Dealing with Anomalies: During events like the COVID-19 pandemic, the filter can identify and remove outliers from datasets, improving forecasting models.
- Cleansing Time Series Data: It effectively corrects point-based anomalies, ensuring data integrity and reliability for predictive modeling.
Routine Methodology
The Kalman Filter V2 requires:
- Configuration Method: Automatic or manual optimization of hyperparameters.
- Connection Type: Similar to the Resample routine, it uses various connection types.
- Dimension Columns: Specifies the columns used for filtering and cleansing.
Output
The routine provides cleansed data, including original and filtered values.
Model Forecast Stage
Overview
The Model Forecast Stage is designed to transform traditional forecasting tables from SensibleAI Forecast into a format suitable for ingestion into Forecast Value Add (FVA) dashboards. This routine simplifies the selection of top-performing models for each business target prediction, allowing users to focus on the most reliable forecasts.
Key Use Cases
-
Cascading Stage Best ML Models:
- Scenario: A user updates their predictions and wants to filter for the best-ranked model per target.
- Process: The user specifies a hierarchy for model selection: Best ML, Best Intelligent, Best, and Best Baseline. The routine trims forecast ranges to match actuals and avoids overlapping forecasts.
- Outcome: A refined table comparing SensibleAI Forecast predictions against customer benchmarks.
-
Backtest Model Forecast:
- Scenario: A consultant experiments with various project configurations and needs to evaluate their performance.
- Process: The routine filters Backtest Model Forecast (BMF) tables, selecting top models based on specified criteria.
- Outcome: Multiple FVA tables that feed into a Forecast Snapshot dashboard for direct comparison.
-
Implementation Comparisons:
- Scenario: Consultants must provide clear comparisons between forecasts generated by SensibleAI Forecast and customer forecasts.
- Outcome: A streamlined process for selecting models that enhances clarity and insight during client engagements.
Input and Output Specifications
Input Component | Description |
---|---|
Source Connection | Connection details for accessing the source data (Tabular Connection) |
Configure Convert Types | Configuration options for converting data types as needed |
Hierarchical Transformations | Select hierarchical transformations for the model forecast table |
Overlapped Forecasts Handling | Options for managing overlapping forecasts: Use Latest, Use Oldest, No Merge |
Forecast Bounds Handling | Options for trimming forecast values relative to actual values |
Actuals Handling | Options for managing actuals from the DMF table: Remove, Copy per Version |
Output
A staged data table with hierarchical selections of top-ranking models, ready for FVA analysis.
Example Output Schema
Column Name | Data Type | Is Nullable |
---|---|---|
Model | String | False |
TargetName | String | False |
Value | Float64 | False |
Date | DateTime | False |
ModelRank | Int64 | False |
PredictionCallID | Object | False |
... | ... | ... |
Numeric Data Fill
Overview
The Numeric Data Fill routine addresses the challenge of null values in datasets, ensuring that analysis and machine learning models are based on complete data. This routine offers various strategies to fill missing values, thus enhancing data integrity.
Key Features
-
Filling Strategies:
- Options include filling with zero, mean, median, mode, min, max, custom values, forward fill, and backward fill.
- Forward and backward fills leverage the last known values for matching dimensions, adding contextual relevance.
-
Use Cases:
- Scenario: A dataset has records but contains null values that could skew analysis.
- Implementation: Users can choose an appropriate fill strategy based on the nature of the data.
Input and Output Specifications
Input Component | Description |
---|---|
Source Data Definition | Connection details and specifications for the source data (TimeSeriesTable) |
Dimension Columns | Columns used as dimensions for filling |
Date Column | Column representing the date |
Value Column | Column containing the values to fill |
Data Fill Definition | Specifies fill strategies for columns |
Output
A data table where missing values have been filled according to the specified strategies.
Prediction Simulator
Overview
The Prediction Simulator allows users to manage and execute multiple data jobs on any project that has passed the data load step in FOR. It replaces the traditional SIM solution, providing a streamlined process for running jobs such as pipeline, deploy, prediction, model rebuild, and project copy. Importantly, users can upload all necessary source data, and the simulator handles updates based on user-defined dates.
Key Features
- Automated Job Management: Users can schedule jobs to run in a specific order, reducing the risk of projects sitting idle.
- User-Friendly Scheduling: Allows running multiple jobs overnight without needing to monitor them actively.
Use Cases
- Busy Consultants: Ideal for consultants juggling multiple projects across various environments, enabling efficient job management.
- Overnight Processing: Users can execute a series of jobs overnight, ensuring all tasks complete by morning.
Routine Methods
Method | Description | Memory Capacity |
---|---|---|
Simulate Predictions | Simulates and runs a specified list of tasks in order | 2.0 GB |
Constructor | Initializes the prediction simulator routine | 0 GB |
Principal Component Analysis
Overview
PCA is a statistical technique used for dimensionality reduction, data compression, and feature extraction. It identifies the principal components that capture the most variance in the data, simplifying complex datasets while retaining essential information.
Key Features
- Dimensionality Reduction: Reduces complexity by transforming datasets into principal components.
- Enhanced Visualization: Makes it easier to analyze and visualize high-dimensional data.
Use Cases
- Anomaly Detection: Identifies unusual patterns in transaction data, aiding in fraud detection.
- Forecasting: Simplifies forecasting models by identifying significant components from various features.
Routine Methods
Method | Description | Memory Capacity |
---|---|---|
Run PCA | Preprocesses and runs PCA on the input dataset. | 2.0 GB |
Replace Special Characters
Overview
This routine focuses on cleansing datasets by identifying and replacing special characters based on a defined schema. It allows users to target specific columns, ensuring data consistency and validity.
Key Features
- Customizable Cleansing: Users can define multiple find-and-replace operations for various columns.
- Improved Data Quality: Ensures data is clean and ready for analysis, reducing errors in subsequent processing.
Use Cases
- Data Standardization: Helps standardize entity identifiers in datasets for accurate forecasting and analysis.
- Error Prevention: Cleanses unrecognized characters that could cause errors during data ingestion.
Routine Methods
Method | Description | Memory Capacity |
---|---|---|
Cleanse Data | Finds and replaces special characters based on user input. | 2.0 GB |
Target Flagging Analysis
Overview
In data analytics, understanding metrics is crucial for making informed decisions. The Target Flagging Analysis (Stateless) routine is designed to calculate various metrics based on both source data and model forecast data. This routine allows users to identify key performance indicators, assess forecast accuracy, and flag potential issues within target dimensions. In this article, we will explore the functionalities, use cases, and different routines associated with this analysis.
The Target Flagging Analysis routine can generate a variety of metrics, enabling users to evaluate data quality and forecast accuracy. Here’s a breakdown of its key components:
Key Metrics Generated
Metric | Source Data | Model Forecast Data |
---|---|---|
Actuals Summation | ✔️ | |
Target Start/End Date | ✔️ | ✔️ |
Collection Lag Days/Periods | ✔️ | |
Start Up Lag Days/Periods | ✔️ | |
IsForecastable | ✔️ | |
Local Density | ✔️ | |
Global Density | ✔️ | |
Mean Absolute Error (MAE) | ✔️ | |
Root Mean Squared Error | ✔️ | |
Bias Error | ✔️ | |
Growth Rate | ✔️ |
Visual Outputs
The routine also generates plots comparing actual values against MAE% and Score%. These visualizations aid in identifying areas needing attention.
Use Cases
The routine supports multiple use cases, each tailored to specific user needs. Here’s a summary of each:
1. Generate All Metrics Without Flags
This routine creates metrics tables for source and model forecast data, helping users understand the overall data landscape without flagging any issues.
Output:
- Metrics tables for both source data and forecast data
- MAE vs Actuals and Score vs Actuals plots
2. Generate Source Metrics Without Flags
This is ideal for data analysts interested in insights from source time series data alone.
Output:
- Source data metrics table
- Excludes plots and flagging
3. Generate Forecast Metrics Without Flags
This routine focuses on the forecast data, providing insights into its accuracy and quality.
Output:
- Forecast data metrics table
- MAE vs Actuals and Score vs Actuals plots
4. All Metrics Analysis
As an implementation consultant, this routine assists in evaluating forecasts across all targets, helping to pinpoint inaccuracies caused by limited data points.
Output:
Combined metrics table for source and forecast data
Detailed Routine Methods
The routine consists of three primary methods, each serving a unique purpose. Below is a table summarizing their functionalities:
Routine Method | Description | Target Flagging | HTML Output |
---|---|---|---|
Source Metrics Analysis | Generates source metrics without flagging | No | Interactive report |
Forecast Metrics Analysis | Generates forecast metrics without flagging | No | Interactive report |
All Metrics Analysis | Combines source and forecast metrics without flagging | No | Comprehensive report |
Input Requirements
Each method has specific input requirements, such as data connection types and dimension specifications. Common inputs include:
- Source Data Definition: Must be a TimeSeriesTableDefinition
- Connection: Can be SQLTabularConnection, FileTabularConnection, etc.
- Dimension Columns, Date Column, Value Column: Specify the relevant columns
Output Formats
The outputs can be generated in various formats, including HTML reports and Parquet files, ensuring flexibility for users in analyzing their data.
Time Series Data Analysis
In the realm of data analytics, time series analysis plays a crucial role, particularly for businesses that track data across various targets. This article explores the Time Series Data Analysis routine, designed to enhance understanding and streamline insights from time series datasets.
Use Cases
1. Implementation Insights for Retail
As an implementation consultant, you may find yourself working with a retail customer who tracks daily data across multiple targets. Understanding the underlying trends in their data can be challenging, especially if it is new territory for both you and the client. The Time Series Data Analysis routine allows you to quickly generate insights and share findings, enhancing both your understanding and that of the customer before moving to the next phases of implementation.
2. Quality Assurance in FOR Projects
Quality checks are critical in validating dataset integrity. A comprehensive time series analysis report can significantly reduce implementation timelines. By generating target-level statistics and visuals, the routine not only expedites the process of deriving actionable insights but also promotes transparent communication with the client regarding data quality.
3. Exploratory Data Analysis
This routine serves a diverse audience—data scientists, analysts, and business professionals alike—helping them extract meaningful insights from time series data. The comprehensive report it generates includes visualizations and interpretations, facilitating a deeper understanding of the dataset.
Routine Methods
Overview of Routine Methods
The Time Series Data Analysis routine offers two primary methods: Generic Analysis and Advanced Analysis. Each generates an HTML report rich in statistics and visualizations but caters to different needs.
Routine Method | Description | Key Features |
---|---|---|
Generic Analysis | Basic interactive report using YData Profiling | High-level dataset summaries, alerts on stationarity, seasonality, distributions |
Advanced Analysis | Comprehensive custom report | Filterable metrics summary, detailed visualizations, target-level plots |
Generic Analysis
The Generic Analysis method employs the open-source YData Profiling library to generate a report that includes:
Alerts about data characteristics like stationarity and seasonality.
A correlation matrix (optional) to help visualize relationships between dimensions.
Required Inputs:
- Source Data Definition
- Connection to source data
- Dimension, date, and value columns
- Title for the report
Output:
A Generic Time Series Report in HTML format, providing an overview of the dataset.
Advanced Analysis
The Advanced Analysis method goes further by creating a more tailored report that includes:
- A filterable summary table with key metrics
- Time series decomposition plots
- Auto-correlation and partial auto-correlation plots
Required Inputs:
Similar to the Generic Analysis, but also includes options for target-level plots
Output:
An Advanced Time Series Report in HTML format, rich with detailed analysis.
Detailed Comparison of Routine Methods
To better illustrate the differences between the two routine methods, here’s a summary table highlighting their features:
Feature | Generic Analysis | Advanced Analysis |
---|---|---|
Utilizes YData Profiling | Yes | No |
Customization | Limited | Extensive |
Summary Statistics | Yes | Yes |
Time Series Decomposition Plots | No | Yes |
Auto-Correlation Plots | Yes | Yes |
Warning Flags for Metrics | Yes | Yes |
Filterable Summary Table | No | Yes |
Correlation Matrix | Optional | Not available |
Comparative Analysis of FOR Routines
The following table summarizes key features, input types, and memory capacities for each routine, providing a quick reference for users:
Routine | Purpose | Input Types | Key Outputs | Memory Capacity |
---|---|---|---|---|
Aggregate Data | Data consolidation and summarization | Various data sources | Unified data views | 2.0 GB |
Forecast Allocation | Distributes forecasted values | Tabular data | Allocated forecast data | 2.0 GB |
Frequency Resampler | Resamples time series data | TimeSeriesTable | Resampled data | 2.0 GB |
Kalman Filter V2 | Refines forecasts with filtering | Time series data | Smoothed forecasts | 2.0 GB |
Model Forecast Stage | Prepares data for FVA analysis | Tabular connection | Staged forecast data | 3.0 GB |
Numeric Data Fill | Fills missing values | TimeSeriesTableDefinition | Complete datasets | 2.0 GB |
Prediction Simulator | Automates job execution | Project data | Scheduled job reports | 2.0 GB |
Principal Component Analysis | Reduces dimensionality | Data matrices | Principal components and visualizations | 2.0 GB |
Replace Special Characters | Cleanses data of special characters | Tabular data | Cleansed datasets | 2.0 GB |
Target Flagging Analysis | Evaluates performance metrics | TimeSeriesTable | Metrics reports with visual outputs | 2.0 GB |
Time Series Data Analysis | Analyzes time series data | TimeSeriesTable | Detailed analysis reports | 2.0 GB |