SensibleAI Forecast: Data Pipeline Guidelines

Author: Drew Shea, Created: 2024-08-29

The purpose of this article is to provide rules, recommendations, and guidelines for building SensibleAI Forecast ingress and egress data pipelines.

As companies look to bring larger target and data volumes into OneStream to train SensibleAI Forecast models, there must be an effective and safe way to process those large data volumes. As a data pipeline developer, being cautious and following these best practices is incredibly important, given that it is possible to tip over OneStream environments without proper care.

Rule #1: Avoid SQL and LINQ Cross Joins

Understanding Cross Joins

A Cross Join in SQL or C# (via LINQ) is a type of join that returns the Cartesian product of two tables. This means that every row from the first table is combined with every row from the second table. If the first table has m rows and the second table has n rows, the resulting dataset will have m * n rows.

Why Cross Joins Should Be Avoided

Performance Issues:
- Exponential Data Growth: Since Cross Joins combine every row from the first table with every row from the second table, the result can quickly become unmanageably large. For instance, joining two tables with 1,000 rows each results in a dataset of 1,000,000 rows. This can cause significant performance degradation, leading to excessive memory consumption, long processing times, and even system crashes.
- Impact on OneStream: In the context of SensibleAI Forecast data pipelines, such large datasets can overwhelm the OneStream environment, leading to potential system instability or failure. The cascading effect of a poorly performing query can tip over the entire environment, impacting not just the data pipeline but other processes relying on the system.
Unintentional Data Explosion:
- Complexity and Risk: Cross Joins are often created unintentionally, especially when a developer forgets to include a proper join condition between tables. This can lead to an unexpected data explosion, where the resulting dataset is orders of magnitude larger than intended. Such mistakes can be difficult to trace and rectify, making it a risky operation.
- Troubleshooting Difficulty: Diagnosing the root cause of performance issues related to Cross Joins can be challenging. The sheer volume of data produced by a Cross Join can obscure other issues in the data pipeline, leading to long and costly debugging sessions.

Best Practices to Avoid Cross Joins

Explicit Join Conditions: Always ensure that join conditions are explicitly stated to prevent accidental Cross Joins. This includes checking for any missing or incorrect conditions in your SQL queries or LINQ expressions.
Use SensibleAI Studio: Very often, the suspected need for a Cross Join is for it to be a precursor to another data transformation (like a numeric data fill). Please check to see if AI Studio Data Transformation routines can directly solve for your use case. As an example, the Numeric Data Fill Routine of AI Studio can take out the need for doing a Cross Join to fill missing records in a time series dataset. Additionally, AI Studio comes with sizeable memory capacity, allowing data pipeline developers to process more data than they can purely on OneStream C# data management servers.

Rule #2: Partition Data Pipeline Inputs Over a Certain Size

Understanding the Need for Partitioning

When dealing with large datasets in SensibleAI Forecast data pipelines, it is crucial to manage the size of the data being processed to avoid performance bottlenecks and system limitations. One of the key challenges arises when working with large tables, such as a model forecast table containing 30 columns and 20 million records. Such a dataset would take up approximately 10 GB of memory when loaded into a C# DataTable.

Limitations and Challenges

Memory Constraints in OneStream Environments:
- Data Management Servers, where these data pipelines typically run, often have memory constraints. In a standard OneStream environment, these servers generally have 32 GB of memory. Loading large datasets into memory, especially those close to or exceeding 10 GB, can exhaust available resources, leading to slowdowns or failures.

Best Practices for Partitioning

Maintain Manageable Data Volumes:
- As a best practice, keep data volumes in a C# DataTable under 10 million rows at any given time within your data pipeline. This ensures that you remain well within both the C# DataTable limits and the available memory on the server, preventing potential system issues.
Only one 10 Million Row DataTable at a Time
- To further safeguard against memory constraints and performance bottlenecks, limit the number of large DataTable objects in memory at any given time. Specifically, ensure that your pipeline never processes more than one DataTable containing up to 10 million rows simultaneously. Attempting to handle multiple large DataTable objects concurrently can lead to excessive memory consumption and severe performance issues, potentially causing the entire pipeline to fail. (See Rule 3 for more detail)
- Implementation Strategy: When working with multiple large datasets, process each one sequentially. Load one dataset into a DataTable, perform the necessary operations, and then clear or dispose of the DataTable before loading the next dataset (use the using statement declaration). This strategy helps maintain a stable memory footprint and ensures that your data pipeline remains performant and reliable. (See Rule 3 for more detail)
Batch Processing for Large Datasets:
- For datasets that exceed 10 million rows, partition the data into smaller, more manageable batches. This can be done by working with subsets of targets or records at a time, and then using a loop to process each batch sequentially. This approach minimizes memory usage and reduces the risk of hitting row count limitations. See the appendix below for a data pipeline example that is batched.
Optimize Pipeline Performance:
- Implementing a batching strategy not only helps in managing memory but also optimizes overall pipeline performance. By processing smaller chunks of data, you reduce the likelihood of performance bottlenecks, allowing your data pipeline to run more efficiently.
Test and Monitor:
- Regularly test your data pipelines with varying data sizes to ensure they perform well under different conditions. Monitoring memory usage and performance metrics during these tests can provide valuable insights, helping you fine-tune your partitioning strategy.

By following these partitioning guidelines, you can effectively manage large data volumes within your SensibleAI Forecast data pipelines, ensuring smooth operation even with substantial datasets.

info

Rule of Thumb: Use Partitioning in data pipelines if working with target intersection volumes greater than 5000 targets. Doing so, will ensure your data pipeline scales to future target volumes without concern of memory issues.

tip

Elite Practice: Just always build your data pipelines with partitioning up front. You can easily control the scalability by dialing up and dialing down the partitioning batch size as needed.

Rule #3: Clean up your DataTables When Done Using Them

The Importance of Cleaning Up `DataTable` Objects

In any data pipeline, particularly those involving large datasets as in SensibleAI Forecast, effective memory management is crucial. One key aspect of this management is ensuring that DataTable objects are properly cleaned up after their use. Failing to do so can lead to memory leaks, which in turn can cause performance degradation, crashes, or even bring down an entire system.

Best Practice: Use `using` Statements for Automatic Cleanup

The recommended best practice for cleaning up DataTable objects is to encapsulate their usage within a using statement. The using statement automatically handles the disposal of the DataTable when it goes out of scope, ensuring that memory is freed immediately, even if an exception occurs.

Example:

using (DataTable dataTable = new DataTable())
{
    // Perform operations on the dataTable
} // The dataTable is automatically disposed of here

Why Use using? The using statement simplifies memory management by ensuring that the DataTable is disposed of as soon as it is no longer needed. This prevents memory leaks and keeps the data pipeline running efficiently. Additionally, it reduces the risk of human error, as you don’t have to remember to manually dispose of the object.

Additional Tips for Effective Cleanup

Avoid Unnecessary Retention:
- Don’t keep DataTable objects in memory longer than necessary. Ensure they are disposed of or go out of scope as soon as their purpose is fulfilled.
Combine with Other Cleanup Strategies:
- If your pipeline involves complex objects or multiple DataTable instances, consider combining using statements with other cleanup strategies, such as clearing the DataTable before disposal to further reduce memory usage.

In summary, the most effective way to clean up DataTable objects is by using using statements. This method ensures automatic disposal and reduces the risk of memory leaks, leading to more stable and efficient data pipelines. For existing pipelines that may not have been built with using statements in mind, the manual disposal method offers a straightforward way to retrofit cleanup mechanisms. Regardless of the approach, consistently cleaning up your DataTable objects is essential to maintaining the performance and reliability of your SensibleAI Forecast data pipelines.

Putting it Together

In this section, we will break down the provided C# code and discuss how memory management works, why using statements help manage memory, and how we can properly partition a data pipeline.

Bad: The Unpartitioned Data Pipeline

Imagine a scenario where we attempt to process the entire dataset at once, without partitioning it into smaller batches. The code might look something like this:

// Load the entire dataset into a single DataTable
DataTable dtFullData = GetModelForecastTable();

// Step 1: Data Cleaning
DataTable dtClean = CleanData(dtFullData);

// Step 2: Data Transformation
DataTable transformedData = TransformData(dtClean);

// Step 3: Data Aggregation
DataTable dtAggregated = AggregateData(transformedData);

// Save or process the aggregated data
SaveAggregatedData(dtAggregated);

At first glance, this code seems straightforward and efficient. However, when you take a closer look, this approach has significant memory management issues, particularly when dealing with large datasets.

Memory Management Risks

High Memory Consumption:
- In this unpartitioned version, the entire dataset is loaded into memory at once using a single DataTable (dtFullData). If the dataset is large, this can consume a substantial amount of memory. Each subsequent step (dtClean, transformedData, dtAggregated) adds additional memory overhead, potentially doubling or tripling the memory usage as data is processed in various stages.
Potential for Out-of-Memory Exceptions:
- If the dataset is too large, the application might run out of memory, leading to an OutOfMemoryException. This not only crashes the application but also risks losing any data processed up to that point.
Garbage Collection Delays:
- Even if memory is sufficient, the .NET garbage collector may not immediately reclaim memory used by DataTable objects after they go out of scope. Without explicit disposal, these large objects might linger in memory longer than necessary, increasing the chances of hitting memory limits.
Performance Degradation:
- Large datasets held in memory can also degrade performance due to increased memory pressure, leading to more frequent garbage collection cycles, higher CPU usage, and slower application response times.

Why Partitioning is Essential

Partitioning the data into smaller, more manageable batches is essential to mitigate these risks. By processing the dataset in chunks, we limit the amount of data held in memory at any given time. This not only conserves memory but also reduces the likelihood of running into memory-related issues. Additionally, smaller data batches are easier to manage, process, and dispose of, leading to more efficient and stable applications.

Solution: A partitioned and garbage collected data pipeline

Main Data Pipeline Code

using Workspace.XBR.Xperiflow.Utilities.Database;
...
// Each loop with return a batch of 1000 targets from the model forecast table
// Note: The usage of "using" directly in the for loop
foreach (var dtBatch in DatabaseUtil.GetTableBatchByDistinctGroups("BiBlendWarehouse", "MyModelForecastTable", new List<string> { "TargetCol1", "TargetCol2" }, 1000))
{
    using (dtBatch)
    {
        // Step 1: Data Cleaning
        using DataTable dtClean = CleanData(dtBatch);

        // Step 2: Data Transformation
        using DataTable transformedData = TransformData(dtClean);

        // Step 3: Data Aggregation
        using DataTable dtAggregated = AggregateData(transformedData);

        // Save or process the aggregated data
        SaveAggregatedData(dtAggregated);
    }
}

Helper Code

// IEnumerable allows it to be a "Generator" - meaning that it can be used in a for loop. This functions slightly differently than a standard function.
public IEnumerable<DataTable> GetTableBatchByDistinctGroups(string connectionKey, string tableName, List<string> partitionColumns, int batchSize)
{
    // Assume this method returns the unique intersection of the partition columns
    List<string> uniqueIntersections = GetUniqueIntersections(connectionKey, tableName, partitionColumns);

    // Pull back the data for each batch and yield it so that it can be used in an for-loop externally
    foreach (var uniqueIntersectionBatch  in uniqueIntersections )
    {
        // Load a batch of data into a DataTable based on the current target name batch
        using DataTable dtBatch = GetTableByUniqueIntersectionList(connectionKey, tableName, uniqueIntersections);
        yield return dtBatch;
    }
}

info

The above helper code uses a type of method called a “generator”. Please see the appendix for more detail.

tip

There is a generator function in Xperiflow Business Rules called DatabaseUtil.GetTableBatchByDistinctGroups that can be found in the namespace Workspace.XBR.Xperiflow.Utilities.Database that is an actual working version of the helper code!

Memory Management with `using` Statements

The using statement in C# is used to ensure that objects that implement the IDisposable interface are disposed of as soon as they are no longer needed. This is crucial when working with resources like file handles, database connections, or, as in this case, large DataTable objects. Disposing of these objects promptly helps free up memory and other resources, preventing memory leaks and ensuring efficient resource management.

New `using` Statement Syntax

In the provided code, the using statements utilize the new syntax available in C# 8.0 and later versions. This syntax is more concise and does not require a closing block:

using DataTable dtClean = CleanData(dtBatch);

This line creates a DataTable object named dtBatch that is automatically disposed of when it goes out of scope, which in this case is at the end of the foreach loop iteration. The same pattern is followed for dtClean, transformedData, and dtAggregated.

How Memory is Managed

dtBatch: Created at the beginning of the loop iteration. It's a potentially large object holding a batch of data. Once this object is no longer needed (after cleaning, transforming, and aggregating the data), it is automatically disposed of at the end of the loop iteration.
dtClean: Created immediately after dtBatch and used to hold the cleaned data. It is disposed of as soon as the transformation step is completed.
transformedData: Holds the transformed data and is disposed of right after the aggregation step.
dtAggregated: Contains the final aggregated data and is disposed of after the data has been saved or processed.

By using the using statement for each DataTable, you ensure that each object is disposed of as soon as it's no longer needed, reducing memory usage and preventing potential memory leaks.

info

Check out the appendix for an explanation on old school using closures.

Conclusion

In summary, building SensibleAI Forecast data pipelines in OneStream requires careful attention to both performance and memory management. By adhering to the rules and best practices outlined in this article, you can avoid common pitfalls such as Cross Joins, excessive memory consumption, and inefficient processing of large datasets.

Avoid Cross Joins: Understanding and avoiding Cross Joins is crucial to preventing exponential data growth that can degrade performance and destabilize your environment. Always ensure your join conditions are explicitly defined, and leverage available tools like the AI Library to handle scenarios that might otherwise require a Cross Join.
Partition Large Data Sets: Managing large datasets by partitioning them into smaller, more manageable batches is essential to maintaining a stable memory footprint and ensuring your pipelines run efficiently. This practice helps you stay within the limitations of your environment while optimizing performance.
Effective DataTable Cleanup: Proper memory management through the use of using statements or other cleanup strategies is critical to preventing memory leaks and maintaining the reliability of your pipelines. By automatically disposing of DataTable objects when they are no longer needed, you reduce memory usage and minimize the risk of system crashes.

By implementing these guidelines, you can develop robust and efficient SensibleAI Forecast data pipelines that are capable of handling large data volumes while maintaining optimal performance. As your datasets grow and your processing needs evolve, regularly revisiting and refining these practices will help ensure that your pipelines remain scalable, stable, and efficient.

Appendix

Equivalent Code with Old-School `using` Statements

For those more familiar with the traditional way of writing using statements (prior to C# 8.0), here’s how the code would look:

using Workspace.XBR.Xperiflow.Utilities.Database;

...


foreach (var dtBatch in DatabaseUtil.GetTableBatchByDistinctGroups("BiBlendWarehouse", "MyModelForecastTable", new List<string> { "TargetCol1", "TargetCol2" }, 1000))
{
    using (dtBatch)
    {
        // Step 1: Data Cleaning
        using (DataTable dtClean = CleanData(dtBatch))
        {
            // Step 2: Data Transformation
            using (DataTable transformedData = TransformData(dtClean))
            {
                // Step 3: Data Aggregation
                using (DataTable dtAggregated = AggregateData(transformedData))
                {
                    // Save or process the aggregated data
                    SaveAggregatedData(dtAggregated);
                }
            }
        }
    }
}

Explanation of the Old Syntax

In this version, each using statement has a block scope { }, ensuring that the objects are disposed of at the end of their respective blocks. The nested structure of using statements makes it clear which resources are being used and when they are released.

Advantages of the Old Syntax: It is explicit and makes it easy to see the scope of each resource.
Disadvantages: It can be more verbose and harder to read, especially when dealing with multiple resources.

info

I recommend using the new school using statements for DataTable objects.

What is a Generator Function?

A generator function in C# is a special type of method that allows you to yield a sequence of values over time, rather than returning them all at once. This is achieved using the yield return statement within the function. When a generator function is called, it does not execute its code all at once. Instead, it returns an IEnumerable<T> object, which can be iterated over. Each iteration triggers the generator function to execute up to the next yield return statement, pausing until the next value is requested.

How Does a Generator Function Work?

When you call a generator function, it doesn’t run the code within it immediately. Instead, it returns an iterator that can be used in a foreach loop or any other context where an IEnumerable is expected. As you iterate, the function runs until it hits the yield return statement, which sends a value back to the caller and pauses execution. When the next value is requested, the function resumes execution from where it left off, continuing this process until the function completes or another yield return is encountered.

Here’s a simple example to illustrate:

public IEnumerable<int> GetNumbers()
{
    for (int i = 1; i <= 5; i++)
    {
        yield return i;
    }
}

In this example, calling GetNumbers() doesn’t immediately return a list of numbers. Instead, it returns an IEnumerable<int>. Each time you iterate over this IEnumerable, the generator produces the next number in sequence.

Why Are Generator Functions Useful?

Memory Efficiency: One of the primary advantages of using a generator function is memory efficiency. Instead of creating a large collection of items in memory all at once, a generator yields one item at a time. This is particularly useful when dealing with large datasets or when processing data in batches, as it helps to keep the memory footprint low. Lazy Evaluation: Generator functions perform "lazy evaluation," meaning they only compute the next value when it is needed. This is beneficial when working with potentially large sequences where not all items are required at once. For example, if you only need the first 10 items from a sequence of a million, the generator will only compute those 10 items, saving time and resources. Batch Processing: In the context of data pipelines, generator functions are incredibly useful for batch processing large datasets. Instead of loading an entire dataset into memory—which can lead to high memory consumption and potential crashes—you can use a generator function to load and process data in smaller, more manageable chunks. This approach aligns well with the best practices for partitioning data pipelines, ensuring that your pipeline remains efficient and scalable.

Example: Using a Generator Function for Batch Processing

Consider the earlier example in the article where we are working with large datasets in a SensibleAI Forecast data pipeline. The helper code provided uses a generator function to partition the dataset and yield batches of data:

public IEnumerable<DataTable> GetPartitionedTable(string connectionKey, string tableName, List<string> partitionColumns, int batchSize)
{
    // Assume this method returns the unique intersection of the partition columns
    List<string> uniqueIntersections = GetUniqueIntersections(connectionKey, tableName, partitionColumns);

    foreach (var uniqueIntersectionBatch in uniqueIntersections)
    {
        // Load a batch of data into a DataTable based on the current target name batch
        using DataTable dtBatch = GetTableByUniqueIntersectionList(connectionKey, tableName, uniqueIntersections);
        yield return dtBatch;
    }
}

In this example, GetPartitionedTable is a generator function that yields DataTable objects containing batches of data. By using a generator, the data pipeline can process one batch at a time, reducing memory usage and allowing for more efficient handling of large datasets.

As you iterate over the generator in a foreach loop, each DataTable is loaded, processed, and then disposed of before moving on to the next batch. This ensures that memory is used effectively, and the application can handle large volumes of data without overwhelming the system.

Conclusion

Generator functions are a powerful tool in C# for handling sequences of data efficiently, especially in scenarios involving large datasets or batch processing. By yielding one item at a time, they help conserve memory and allow for lazy evaluation, making your data pipelines more scalable and performant. In the context of SensibleAI Forecast data pipelines, generator functions are particularly valuable for partitioning and processing data in a way that optimizes both performance and resource usage.

Rule #1: Avoid SQL and LINQ Cross Joins​

Understanding Cross Joins​

Why Cross Joins Should Be Avoided​

Best Practices to Avoid Cross Joins​

Rule #2: Partition Data Pipeline Inputs Over a Certain Size​

Understanding the Need for Partitioning​

Limitations and Challenges​

Best Practices for Partitioning​

Rule #3: Clean up your DataTables When Done Using Them​

The Importance of Cleaning Up DataTable Objects​

Best Practice: Use using Statements for Automatic Cleanup​

Additional Tips for Effective Cleanup​

Putting it Together​

Bad: The Unpartitioned Data Pipeline​

Memory Management Risks​

Why Partitioning is Essential​

Solution: A partitioned and garbage collected data pipeline​

Main Data Pipeline Code​

Helper Code​

Memory Management with using Statements​

New using Statement Syntax​

How Memory is Managed​

Conclusion​

Appendix​

Equivalent Code with Old-School using Statements​

Explanation of the Old Syntax​

What is a Generator Function?​

How Does a Generator Function Work?​

Why Are Generator Functions Useful?​

Example: Using a Generator Function for Batch Processing​

Conclusion​

Rule #1: Avoid SQL and LINQ Cross Joins

Understanding Cross Joins

Why Cross Joins Should Be Avoided

Best Practices to Avoid Cross Joins

Rule #2: Partition Data Pipeline Inputs Over a Certain Size

Understanding the Need for Partitioning

Limitations and Challenges

Best Practices for Partitioning

Rule #3: Clean up your DataTables When Done Using Them

The Importance of Cleaning Up `DataTable` Objects

Best Practice: Use `using` Statements for Automatic Cleanup

Additional Tips for Effective Cleanup

Putting it Together

Bad: The Unpartitioned Data Pipeline

Memory Management Risks

Why Partitioning is Essential

Solution: A partitioned and garbage collected data pipeline

Main Data Pipeline Code

Helper Code

Memory Management with `using` Statements

New `using` Statement Syntax

How Memory is Managed

Conclusion

Appendix

Equivalent Code with Old-School `using` Statements

Explanation of the Old Syntax

What is a Generator Function?

How Does a Generator Function Work?

Why Are Generator Functions Useful?

Example: Using a Generator Function for Batch Processing

Conclusion