Data Quality Enhancement Best Practices

Author: Matthew Rutty, Created: 2024-05-01

Suppose you have progressed through the initial validation of your data, understanding where it comes from, the meaning of different dimensions, and any irregularities. Now you want to analyze the best possible approach to processing your data to put into SensibleAI Forecast. This article will walk through many of the key principles associated with taking a data set and making it ready for SensibleAI Forecast.

Data Quality Resolutions

Data Quality Issue 1: Sparse Data

The issue of sparse data is a common issue, especially with higher granularity (such as monthly data), where each data point is extremely critical. A sparse dataset can reduce accuracy by having an insufficient history to understand trends and causality.

One method for resolving sparse data is through grouping. Grouping allows the models to learn from other targets, which increases the pool of trends and analyses that can be performed. One major tradeoff of grouping, however, is the reduction in granular insights through the Explanations tab in your model. Grouping methodology should be a key 1st course of study at the beginning of RPE (Rapid Project Experimentation).

Data Quality Issue 2: Fundamental Changes in Business Over Time

Situations where a business fundamentally shifts (Brick and Mortar to Online, product generations, technology changes) are frequent and important to note. One such example is after COVID.

Almost every business experienced shifts during the pandemic. In this situation, we can add a Known-in-Advance feature or event set which indicates the start and end dates of the period, and the fact that the current state is not under those conditions.

However, many businesses had results that permeated through to their ‘steady state’ after the pandemic ended. In such examples, it often is the best course of action to remove those periods from the data altogether. The more data that can be provided for machine learning the better, as long as patterns remain relatively consistent from start to end.

Data Quality Issue 3: Outliers

Anomalous behavior within data is completely normal but must be handled properly to ensure that the models respond correctly in the absence (and presence) of them in the future.

There are 3 major approaches to handling these:

First, if the anomaly can be explained, we can provide feature and event sets that explain these occurrences to machine learning models and allow them to make accurate predictions in the future.
If the anomaly cannot be explained, but data must still exist, you can use mathematical methods to cleanse. Some examples are mean imputation, weight trimming, and robust estimation techniques (like M-estimation).
Finally, if the outlier has no relevance to the dataset, the value could just be dropped (by imputing to zero in SensibleAI Forecast).

It is normal to have irregularities in source data, through a mixture of inconsistent collection practices, lack of data documentation, and general business trends and variance. To maximize the results in SensibleAI Forecast and any predictive analytics, the irregularities within the data must be documented and presented in a way that they are predictable.

Data Quality Resolutions​

Data Quality Issue 1: Sparse Data​

Data Quality Issue 2: Fundamental Changes in Business Over Time​

Data Quality Issue 3: Outliers​

Data Quality Resolutions

Data Quality Issue 1: Sparse Data

Data Quality Issue 2: Fundamental Changes in Business Over Time

Data Quality Issue 3: Outliers