SensibleAI Forecast: Data Pipeline Guidelines
The purpose of this article is to provide rules, recommendations, and guidelines for building SensibleAI Forecast ingress and egress data pipelines.
As companies look to bring larger target and data volumes into OneStream to train SensibleAI Forecast models, there must be an effective and safe way to process those large data volumes. As a data pipeline developer, being cautious and following these best practices is incredibly important, given that it is possible to tip over OneStream environments without proper care.
Rule #1: Avoid SQL and LINQ Cross Joins
Understanding Cross Joins
A Cross Join in SQL or C# (via LINQ) is a type of join that returns the Cartesian product of two tables. This means that every row from the first table is combined with every row from the second table. If the first table has m
rows and the second table has n
rows, the resulting dataset will have m * n
rows.
Why Cross Joins Should Be Avoided
-
Performance Issues:
- Exponential Data Growth: Since Cross Joins combine every row from the first table with every row from the second table, the result can quickly become unmanageably large. For instance, joining two tables with 1,000 rows each results in a dataset of 1,000,000 rows. This can cause significant performance degradation, leading to excessive memory consumption, long processing times, and even system crashes.
- Impact on OneStream: In the context of SensibleAI Forecast data pipelines, such large datasets can overwhelm the OneStream environment, leading to potential system instability or failure. The cascading effect of a poorly performing query can tip over the entire environment, impacting not just the data pipeline but other processes relying on the system.
-
Unintentional Data Explosion:
- Complexity and Risk: Cross Joins are often created unintentionally, especially when a developer forgets to include a proper join condition between tables. This can lead to an unexpected data explosion, where the resulting dataset is orders of magnitude larger than intended. Such mistakes can be difficult to trace and rectify, making it a risky operation.
- Troubleshooting Difficulty: Diagnosing the root cause of performance issues related to Cross Joins can be challenging. The sheer volume of data produced by a Cross Join can obscure other issues in the data pipeline, leading to long and costly debugging sessions.
Best Practices to Avoid Cross Joins
- Explicit Join Conditions: Always ensure that join conditions are explicitly stated to prevent accidental Cross Joins. This includes checking for any missing or incorrect conditions in your SQL queries or LINQ expressions.
- Use SensibleAI Studio: Very often, the suspected need for a Cross Join is for it to be a precursor to another data transformation (like a numeric data fill). Please check to see if AI Studio Data Transformation routines can directly solve for your use case. As an example, the Numeric Data Fill Routine of AI Studio can take out the need for doing a Cross Join to fill missing records in a time series dataset. Additionally, AI Studio comes with sizeable memory capacity, allowing data pipeline developers to process more data than they can purely on OneStream C# data management servers.
Rule #2: Partition Data Pipeline Inputs Over a Certain Size
Understanding the Need for Partitioning
When dealing with large datasets in SensibleAI Forecast data pipelines, it is crucial to manage the size of the data being processed to avoid performance bottlenecks and system limitations. One of the key challenges arises when working with large tables, such as a model forecast table containing 30 columns and 20 million records. Such a dataset would take up approximately 10 GB of memory when loaded into a C# DataTable
.
Limitations and Challenges
-
Memory Constraints in OneStream Environments:
- Data Management Servers, where these data pipelines typically run, often have memory constraints. In a standard OneStream environment, these servers generally have 32 GB of memory. Loading large datasets into memory, especially those close to or exceeding 10 GB, can exhaust available resources, leading to slowdowns or failures.
Best Practices for Partitioning
-
Maintain Manageable Data Volumes:
- As a best practice, keep data volumes in a C#
DataTable
under 10 million rows at any given time within your data pipeline. This ensures that you remain well within both the C#DataTable
limits and the available memory on the server, preventing potential system issues.
- As a best practice, keep data volumes in a C#
-
Only one 10 Million Row DataTable at a Time
- To further safeguard against memory constraints and performance bottlenecks, limit the number of large
DataTable
objects in memory at any given time. Specifically, ensure that your pipeline never processes more than oneDataTable
containing up to 10 million rows simultaneously. Attempting to handle multiple largeDataTable
objects concurrently can lead to excessive memory consumption and severe performance issues, potentially causing the entire pipeline to fail. (See Rule 3 for more detail) - Implementation Strategy: When working with multiple large datasets, process each one sequentially. Load one dataset into a
DataTable
, perform the necessary operations, and then clear or dispose of theDataTable
before loading the next dataset (use theusing
statement declaration). This strategy helps maintain a stable memory footprint and ensures that your data pipeline remains performant and reliable. (See Rule 3 for more detail)
- To further safeguard against memory constraints and performance bottlenecks, limit the number of large
-
Batch Processing for Large Datasets:
- For datasets that exceed 10 million rows, partition the data into smaller, more manageable batches. This can be done by working with subsets of targets or records at a time, and then using a loop to process each batch sequentially. This approach minimizes memory usage and reduces the risk of hitting row count limitations. See the appendix below for a data pipeline example that is batched.
-
Optimize Pipeline Performance:
- Implementing a batching strategy not only helps in managing memory but also optimizes overall pipeline performance. By processing smaller chunks of data, you reduce the likelihood of performance bottlenecks, allowing your data pipeline to run more efficiently.
-
Test and Monitor:
- Regularly test your data pipelines with varying data sizes to ensure they perform well under different conditions. Monitoring memory usage and performance metrics during these tests can provide valuable insights, helping you fine-tune your partitioning strategy.
By following these partitioning guidelines, you can effectively manage large data volumes within your SensibleAI Forecast data pipelines, ensuring smooth operation even with substantial datasets.
Rule of Thumb: Use Partitioning in data pipelines if working with target intersection volumes greater than 5000 targets. Doing so, will ensure your data pipeline scales to future target volumes without concern of memory issues.
Elite Practice: Just always build your data pipelines with partitioning up front. You can easily control the scalability by dialing up and dialing down the partitioning batch size as needed.
Rule #3: Clean up your DataTables When Done Using Them
The Importance of Cleaning Up DataTable
Objects
In any data pipeline, particularly those involving large datasets as in SensibleAI Forecast, effective memory management is crucial. One key aspect of this management is ensuring that DataTable
objects are properly cleaned up after their use. Failing to do so can lead to memory leaks, which in turn can cause performance degradation, crashes, or even bring down an entire system.
Best Practice: Use using
Statements for Automatic Cleanup
The recommended best practice for cleaning up DataTable
objects is to encapsulate their usage within a using
statement. The using
statement automatically handles the disposal of the DataTable
when it goes out of scope, ensuring that memory is freed immediately, even if an exception occurs.
Example:
using (DataTable dataTable = new DataTable())
{
// Perform operations on the dataTable
} // The dataTable is automatically disposed of here
- Why Use
using
? Theusing
statement simplifies memory management by ensuring that theDataTable
is disposed of as soon as it is no longer needed. This prevents memory leaks and keeps the data pipeline running efficiently. Additionally, it reduces the risk of human error, as you don’t have to remember to manually dispose of the object.
Additional Tips for Effective Cleanup
-
Avoid Unnecessary Retention:
- Don’t keep
DataTable
objects in memory longer than necessary. Ensure they are disposed of or go out of scope as soon as their purpose is fulfilled.
- Don’t keep
-
Combine with Other Cleanup Strategies:
- If your pipeline involves complex objects or multiple
DataTable
instances, consider combiningusing
statements with other cleanup strategies, such as clearing theDataTable
before disposal to further reduce memory usage.
- If your pipeline involves complex objects or multiple
In summary, the most effective way to clean up DataTable
objects is by using using
statements. This method ensures automatic disposal and reduces the risk of memory leaks, leading to more stable and efficient data pipelines. For existing pipelines that may not have been built with using
statements in mind, the manual disposal method offers a straightforward way to retrofit cleanup mechanisms. Regardless of the approach, consistently cleaning up your DataTable
objects is essential to maintaining the performance and reliability of your SensibleAI Forecast data pipelines.
Putting it Together
In this section, we will break down the provided C# code and discuss how memory management works, why using
statements help manage memory, and how we can properly partition a data pipeline.
Bad: The Unpartitioned Data Pipeline
Imagine a scenario where we attempt to process the entire dataset at once, without partitioning it into smaller batches. The code might look something like this:
// Load the entire dataset into a single DataTable
DataTable dtFullData = GetModelForecastTable();
// Step 1: Data Cleaning
DataTable dtClean = CleanData(dtFullData);
// Step 2: Data Transformation
DataTable transformedData = TransformData(dtClean);
// Step 3: Data Aggregation
DataTable dtAggregated = AggregateData(transformedData);
// Save or process the aggregated data
SaveAggregatedData(dtAggregated);
At first glance, this code seems straightforward and efficient. However, when you take a closer look, this approach has significant memory management issues, particularly when dealing with large datasets.
Memory Management Risks
-
High Memory Consumption:
- In this unpartitioned version, the entire dataset is loaded into memory at once using a single
DataTable
(dtFullData
). If the dataset is large, this can consume a substantial amount of memory. Each subsequent step (dtClean
,transformedData
,dtAggregated
) adds additional memory overhead, potentially doubling or tripling the memory usage as data is processed in various stages.
- In this unpartitioned version, the entire dataset is loaded into memory at once using a single
-
Potential for Out-of-Memory Exceptions:
- If the dataset is too large, the application might run out of memory, leading to an
OutOfMemoryException
. This not only crashes the application but also risks losing any data processed up to that point.
- If the dataset is too large, the application might run out of memory, leading to an
-
Garbage Collection Delays:
- Even if memory is sufficient, the .NET garbage collector may not immediately reclaim memory used by
DataTable
objects after they go out of scope. Without explicit disposal, these large objects might linger in memory longer than necessary, increasing the chances of hitting memory limits.
- Even if memory is sufficient, the .NET garbage collector may not immediately reclaim memory used by
-
Performance Degradation:
- Large datasets held in memory can also degrade performance due to increased memory pressure, leading to more frequent garbage collection cycles, higher CPU usage, and slower application response times.
Why Partitioning is Essential
Partitioning the data into smaller, more manageable batches is essential to mitigate these risks. By processing the dataset in chunks, we limit the amount of data held in memory at any given time. This not only conserves memory but also reduces the likelihood of running into memory-related issues. Additionally, smaller data batches are easier to manage, process, and dispose of, leading to more efficient and stable applications.
Solution: A partitioned and garbage collected data pipeline
Main Data Pipeline Code
using Workspace.XBR.Xperiflow.Utilities.Database;
...
// Each loop with return a batch of 1000 targets from the model forecast table
// Note: The usage of "using" directly in the for loop
foreach (var dtBatch in DatabaseUtil.GetTableBatchByDistinctGroups("BiBlendWarehouse", "MyModelForecastTable", new List<string> { "TargetCol1", "TargetCol2" }, 1000))
{
using (dtBatch)
{
// Step 1: Data Cleaning
using DataTable dtClean = CleanData(dtBatch);
// Step 2: Data Transformation
using DataTable transformedData = TransformData(dtClean);
// Step 3: Data Aggregation
using DataTable dtAggregated = AggregateData(transformedData);
// Save or process the aggregated data
SaveAggregatedData(dtAggregated);
}
}
Helper Code
// IEnumerable allows it to be a "Generator" - meaning that it can be used in a for loop. This functions slightly differently than a standard function.
public IEnumerable<DataTable> GetTableBatchByDistinctGroups(string connectionKey, string tableName, List<string> partitionColumns, int batchSize)
{
// Assume this method returns the unique intersection of the partition columns
List<string> uniqueIntersections = GetUniqueIntersections(connectionKey, tableName, partitionColumns);
// Pull back the data for each batch and yield it so that it can be used in an for-loop externally
foreach (var uniqueIntersectionBatch in uniqueIntersections )
{
// Load a batch of data into a DataTable based on the current target name batch
using DataTable dtBatch = GetTableByUniqueIntersectionList(connectionKey, tableName, uniqueIntersections);
yield return dtBatch;
}
}
The above helper code uses a type of method called a “generator”. Please see the appendix for more detail.
There is a generator function in Xperiflow Business Rules called DatabaseUtil.GetTableBatchByDistinctGroups
that can be found in the namespace Workspace.XBR.Xperiflow.Utilities.Database
that is an actual working version of the helper code!
Memory Management with using
Statements
The using
statement in C# is used to ensure that objects that implement the IDisposable
interface are disposed of as soon as they are no longer needed. This is crucial when working with resources like file handles, database connections, or, as in this case, large DataTable
objects. Disposing of these objects promptly helps free up memory and other resources, preventing memory leaks and ensuring efficient resource management.
New using
Statement Syntax
In the provided code, the using
statements utilize the new syntax available in C# 8.0 and later versions. This syntax is more concise and does not require a closing block:
using DataTable dtClean = CleanData(dtBatch);
This line creates a DataTable
object named dtBatch
that is automatically disposed of when it goes out of scope, which in this case is at the end of the foreach
loop iteration. The same pattern is followed for dtClean
, transformedData
, and dtAggregated
.
How Memory is Managed
dtBatch
: Created at the beginning of the loop iteration. It's a potentially large object holding a batch of data. Once this object is no longer needed (after cleaning, transforming, and aggregating the data), it is automatically disposed of at the end of the loop iteration.dtClean
: Created immediately afterdtBatch
and used to hold the cleaned data. It is disposed of as soon as the transformation step is completed.transformedData
: Holds the transformed data and is disposed of right after the aggregation step.dtAggregated
: Contains the final aggregated data and is disposed of after the data has been saved or processed.
By using the using
statement for each DataTable
, you ensure that each object is disposed of as soon as it's no longer needed, reducing memory usage and preventing potential memory leaks.
Check out the appendix for an explanation on old school using
closures.
Conclusion
In summary, building SensibleAI Forecast data pipelines in OneStream requires careful attention to both performance and memory management. By adhering to the rules and best practices outlined in this article, you can avoid common pitfalls such as Cross Joins, excessive memory consumption, and inefficient processing of large datasets.
- Avoid Cross Joins: Understanding and avoiding Cross Joins is crucial to preventing exponential data growth that can degrade performance and destabilize your environment. Always ensure your join conditions are explicitly defined, and leverage available tools like the AI Library to handle scenarios that might otherwise require a Cross Join.
- Partition Large Data Sets: Managing large datasets by partitioning them into smaller, more manageable batches is essential to maintaining a stable memory footprint and ensuring your pipelines run efficiently. This practice helps you stay within the limitations of your environment while optimizing performance.
- Effective DataTable Cleanup: Proper memory management through the use of
using
statements or other cleanup strategies is critical to preventing memory leaks and maintaining the reliability of your pipelines. By automatically disposing ofDataTable
objects when they are no longer needed, you reduce memory usage and minimize the risk of system crashes.
By implementing these guidelines, you can develop robust and efficient SensibleAI Forecast data pipelines that are capable of handling large data volumes while maintaining optimal performance. As your datasets grow and your processing needs evolve, regularly revisiting and refining these practices will help ensure that your pipelines remain scalable, stable, and efficient.
Appendix
Equivalent Code with Old-School using
Statements
For those more familiar with the traditional way of writing using
statements (prior to C# 8.0), here’s how the code would look:
using Workspace.XBR.Xperiflow.Utilities.Database;
...
foreach (var dtBatch in DatabaseUtil.GetTableBatchByDistinctGroups("BiBlendWarehouse", "MyModelForecastTable", new List<string> { "TargetCol1", "TargetCol2" }, 1000))
{
using (dtBatch)
{
// Step 1: Data Cleaning
using (DataTable dtClean = CleanData(dtBatch))
{
// Step 2: Data Transformation
using (DataTable transformedData = TransformData(dtClean))
{
// Step 3: Data Aggregation
using (DataTable dtAggregated = AggregateData(transformedData))
{
// Save or process the aggregated data
SaveAggregatedData(dtAggregated);
}
}
}
}
}
Explanation of the Old Syntax
In this version, each using
statement has a block scope { }
, ensuring that the objects are disposed of at the end of their respective blocks. The nested structure of using
statements makes it clear which resources are being used and when they are released.
- Advantages of the Old Syntax: It is explicit and makes it easy to see the scope of each resource.
- Disadvantages: It can be more verbose and harder to read, especially when dealing with multiple resources.
I recommend using the new school using
statements for DataTable
objects.
What is a Generator Function?
- A generator function in C# is a special type of method that allows you to yield a sequence of values over time, rather than returning them all at once. This is achieved using the
yield return
statement within the function. When a generator function is called, it does not execute its code all at once. Instead, it returns anIEnumerable<T>
object, which can be iterated over. Each iteration triggers the generator function to execute up to the nextyield return
statement, pausing until the next value is requested.
How Does a Generator Function Work?
When you call a generator function, it doesn’t run the code within it immediately. Instead, it returns an iterator that can be used in a foreach
loop or any other context where an IEnumerable
is expected. As you iterate, the function runs until it hits the yield return
statement, which sends a value back to the caller and pauses execution. When the next value is requested, the function resumes execution from where it left off, continuing this process until the function completes or another yield return
is encountered.
Here’s a simple example to illustrate:
public IEnumerable<int> GetNumbers()
{
for (int i = 1; i <= 5; i++)
{
yield return i;
}
}
In this example, calling GetNumbers()
doesn’t immediately return a list of numbers. Instead, it returns an IEnumerable<int>
. Each time you iterate over this IEnumerable
, the generator produces the next number in sequence.
Why Are Generator Functions Useful?
Memory Efficiency: One of the primary advantages of using a generator function is memory efficiency. Instead of creating a large collection of items in memory all at once, a generator yields one item at a time. This is particularly useful when dealing with large datasets or when processing data in batches, as it helps to keep the memory footprint low. Lazy Evaluation: Generator functions perform "lazy evaluation," meaning they only compute the next value when it is needed. This is beneficial when working with potentially large sequences where not all items are required at once. For example, if you only need the first 10 items from a sequence of a million, the generator will only compute those 10 items, saving time and resources. Batch Processing: In the context of data pipelines, generator functions are incredibly useful for batch processing large datasets. Instead of loading an entire dataset into memory—which can lead to high memory consumption and potential crashes—you can use a generator function to load and process data in smaller, more manageable chunks. This approach aligns well with the best practices for partitioning data pipelines, ensuring that your pipeline remains efficient and scalable.
Example: Using a Generator Function for Batch Processing
Consider the earlier example in the article where we are working with large datasets in a SensibleAI Forecast data pipeline. The helper code provided uses a generator function to partition the dataset and yield batches of data:
public IEnumerable<DataTable> GetPartitionedTable(string connectionKey, string tableName, List<string> partitionColumns, int batchSize)
{
// Assume this method returns the unique intersection of the partition columns
List<string> uniqueIntersections = GetUniqueIntersections(connectionKey, tableName, partitionColumns);
foreach (var uniqueIntersectionBatch in uniqueIntersections)
{
// Load a batch of data into a DataTable based on the current target name batch
using DataTable dtBatch = GetTableByUniqueIntersectionList(connectionKey, tableName, uniqueIntersections);
yield return dtBatch;
}
}
In this example, GetPartitionedTable
is a generator function that yields DataTable
objects containing batches of data. By using a generator, the data pipeline can process one batch at a time, reducing memory usage and allowing for more efficient handling of large datasets.
As you iterate over the generator in a foreach
loop, each DataTable
is loaded, processed, and then disposed of before moving on to the next batch. This ensures that memory is used effectively, and the application can handle large volumes of data without overwhelming the system.
Conclusion
Generator functions are a powerful tool in C# for handling sequences of data efficiently, especially in scenarios involving large datasets or batch processing. By yielding one item at a time, they help conserve memory and allow for lazy evaluation, making your data pipelines more scalable and performant. In the context of SensibleAI Forecast data pipelines, generator functions are particularly valuable for partitioning and processing data in a way that optimizes both performance and resource usage.