Skip to main content

Aggregator

Versions

v1.0.0

Basic Information

Class Name: Aggregator

Title: Aggregate Data

Version: 1.0.0

Author: Jeff Robinson

Organization: OneStream

Creation Date: 2024-02-27

Default Routine Memory Capacity: 2 GB

Tags

Data Transformation, Data Resampling, Time Series, Interpretability, Data Preprocessing, Data Analysis, Dimensionality Reduction, Information Retrieval

Description

Short Description

Aggregate data by specified columns and aggregation method

Long Description

Aggregate selected columns based on an aggregation type for time series data as well as other datasets using this powerful streamlined routine. Supported aggregation methods include sum, mean, min, max, and multiple columns for aggregation may be specified. See polars documentation for additional supported aggregation methods.

Use Cases

1. Aggregated Dimension Insights

A national retail chain seeks to enhance its understanding of store performance across various regions. The chain operates hundreds of stores, each with thousands of transactions every day, generating vast amounts of data. Grouping data allows for the aggregation of information to produce summary statistics, such as totals, averages, and counts. Summing a column within these groups provides insights into the total values of specific categories, which is essential for understanding the scale or impact of different segments within the dataset. This requires periodic data aggregation in order to feed visuals and/or generate insights. The AggregatorRoutine provides a powerful, flexible option for creating instances that can be rerun, recalled, and modified, significantly reducing the effort required for this task. This methodology supports the chain in pinpointing areas of high performance and identifying underperforming stores, enabling targeted strategies for improvement. Advanced analytics tools, including machine learning models, can be applied to this aggregated data to forecast future trends and behaviors. These predictive insights allow the company to tailor its inventory and marketing efforts more effectively, enhancing customer satisfaction and driving sales growth. Additionally, integrating customer feedback and social media analytics into this data aggregation process can offer a more nuanced understanding of consumer preferences and perceptions, further informing business strategies. By leveraging the AggregatorRoutine in conjunction with these sophisticated analytical techniques, the retail chain not only streamlines its data management processes but also fosters a culture of continuous improvement and innovation, ensuring its long-term success in a competitive market landscape.

2. Time Series Trend Analysis

A national retail chain aims to identify and analyze trends in consumer behavior and sales performance over time across its various regions. With operations spanning hundreds of stores and managing thousands of transactions daily, the company generates a considerable volume of data, ripe for detailed trend analysis. By employing data grouping techniques, the chain can analyze trends within specific intervals — daily, monthly, quarterly, or annually—providing a temporal dimension to the data that reveals how sales figures and consumer behavior evolve over time. The AggregatorRoutine provides a powerful, flexible option for creating instances that can be rerun, recalled, and modified, significantly reducing the effort required for this task. This approach enables the company to adjust quickly to market demands, optimize inventory levels, and tailor marketing strategies to consumer preferences in real time. Furthermore, by leveraging advanced analytics and machine learning algorithms, the chain can predict future trends, identifying potential growth opportunities or areas requiring intervention. This proactive stance ensures competitiveness in a rapidly changing retail environment, fostering innovation and customer satisfaction. The integration of geographical information systems (GIS) allows for the visualization of sales data across different regions, highlighting areas of high performance and those needing improvement. Through these sophisticated analytical methods, the company not only stays ahead in the industry but also contributes to a dynamic, customer-centric shopping experience.

Routine Methods

1. Aggregate (Method)
  • Method: aggregate
    • Type: Method

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: Yes

    • Method Limits: This method performs well across a range of dataset sizes and completes efficiently even with large datasets. Testing at 100 GB of memory shows completion times under 10 minutes for datasets ranging from 2,000 to 25,000 targets and up to 7.4 million rows.

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • Run an aggregation routine to group by user specified fields and aggregation type.
    • Detailed Description:

      • Multiple aggregation fields and types can be input in key-value pair list.
    • Inputs:

      • Required Input
        • Source Connection: The connection information source data.
          • Name: data_connection
          • Tooltip:
            • Detail:
              • Click on the drop down box to specify your dataset source.
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Tabular Connection
          • Nested Model: Tabular Connection
            • Required Input
              • Connection: The connection type to use to access the source data.
                • Name: tabular_connection
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be one of the following
                  • SQL Server Connection
                    • Required Input
                      • Database Resource: The name of the database resource to connect to.
                        • Name: database_resource
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Database Name: The name of the database to connect to.
                        • Name: database_name
                        • Tooltip:
                          • Detail:
                            • Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Table Name: The name of the table to use.
                        • Name: table_name
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Path: The full file path to the file to ingest.
                        • Name: file_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • Partitioned MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Type: The type of files to read from the directory.
                        • Name: file_type
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: FileExtensions_
                      • Directory Path: The full directory path containing partitioned tabular files.
                        • Name: directory_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
        • Columns to Group: Specify the column(s) that you want to be grouped.
          • Name: grouped_columns
          • Tooltip:
            • Detail:
              • Click on the drop down box or click the search bar to specify your column(s).
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: list[str]
        • Aggregation Step Input: Continue: continue to run routine, Add: aggregate another column, Previous: modify your previous input.
          • Name: column_agg_params
          • Tooltip:
            • Detail:
              • Click on the drop down box to specify your next step.
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: list[AggregationColumnDefinition]
    • Artifacts:

      • Aggregate Data: Aggregate data based on user input
        • Qualified Key Annotation: aggregated_data
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@aggregated_data/data_/data_<int>.parquet
            • A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: Aggregator

Method NameArtifact Keys
aggregateaggregated_data

Was this page helpful?