Aggregator

Versions

1.0.0

v1.0.0

Basic Information

Class Name: Aggregator

Title: Aggregate Data

Version: 1.0.0

Author: Jeff Robinson

Organization: OneStream

Creation Date: 2024-02-27

Default Routine Memory Capacity: 2 GB

Description

Short Description

Aggregate data by specified columns and aggregation method

Long Description

Aggregate selected columns based on an aggregation type for time series data as well as other datasets using this powerful streamlined routine. Supported aggregation methods include sum, mean, min, max, and multiple columns for aggregation may be specified. See polars documentation for additional supported aggregation methods.

Use Cases

1. Aggregated Dimension Insights

A national retail chain seeks to enhance its understanding of store performance across various regions. The chain operates hundreds of stores, each with thousands of transactions every day, generating vast amounts of data. Grouping data allows for the aggregation of information to produce summary statistics, such as totals, averages, and counts. Summing a column within these groups provides insights into the total values of specific categories, which is essential for understanding the scale or impact of different segments within the dataset. This requires periodic data aggregation in order to feed visuals and/or generate insights. The AggregatorRoutine provides a powerful, flexible option for creating instances that can be rerun, recalled, and modified, significantly reducing the effort required for this task. This methodology supports the chain in pinpointing areas of high performance and identifying underperforming stores, enabling targeted strategies for improvement. Advanced analytics tools, including machine learning models, can be applied to this aggregated data to forecast future trends and behaviors. These predictive insights allow the company to tailor its inventory and marketing efforts more effectively, enhancing customer satisfaction and driving sales growth. Additionally, integrating customer feedback and social media analytics into this data aggregation process can offer a more nuanced understanding of consumer preferences and perceptions, further informing business strategies. By leveraging the AggregatorRoutine in conjunction with these sophisticated analytical techniques, the retail chain not only streamlines its data management processes but also fosters a culture of continuous improvement and innovation, ensuring its long-term success in a competitive market landscape.

2. Time Series Trend Analysis

A national retail chain aims to identify and analyze trends in consumer behavior and sales performance over time across its various regions. With operations spanning hundreds of stores and managing thousands of transactions daily, the company generates a considerable volume of data, ripe for detailed trend analysis. By employing data grouping techniques, the chain can analyze trends within specific intervals — daily, monthly, quarterly, or annually—providing a temporal dimension to the data that reveals how sales figures and consumer behavior evolve over time. The AggregatorRoutine provides a powerful, flexible option for creating instances that can be rerun, recalled, and modified, significantly reducing the effort required for this task. This approach enables the company to adjust quickly to market demands, optimize inventory levels, and tailor marketing strategies to consumer preferences in real time. Furthermore, by leveraging advanced analytics and machine learning algorithms, the chain can predict future trends, identifying potential growth opportunities or areas requiring intervention. This proactive stance ensures competitiveness in a rapidly changing retail environment, fostering innovation and customer satisfaction. The integration of geographical information systems (GIS) allows for the visualization of sales data across different regions, highlighting areas of high performance and those needing improvement. Through these sophisticated analytical methods, the company not only stays ahead in the industry but also contributes to a dynamic, customer-centric shopping experience.

Routine Methods

1. Aggregate (Method)

Method: aggregate
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: Yes
- Method Limits: This method performs well across a range of dataset sizes and completes efficiently even with large datasets. Testing at 100 GB of memory shows completion times under 10 minutes for datasets ranging from 2,000 to 25,000 targets and up to 7.4 million rows.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Run an aggregation routine to group by user specified fields and aggregation type.
- Detailed Description:
  - Multiple aggregation fields and types can be input in key-value pair list.
- Inputs:
  - Required Input
    - Source Connection: The connection information source data.
      - Name: data_connection
      - Tooltip:
        
        Detail:
        
        Click on the drop down box to specify your dataset source.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Columns to Group: Specify the column(s) that you want to be grouped.
      - Name: grouped_columns
      - Tooltip:
        
        Detail:
        
        Click on the drop down box or click the search bar to specify your column(s).
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[str]
    - Aggregation Step Input: Continue: continue to run routine, Add: aggregate another column, Previous: modify your previous input.
      - Name: column_agg_params
      - Tooltip:
        
        Detail:
        
        Click on the drop down box to specify your next step.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[AggregationColumnDefinition]
- Artifacts:
  - Aggregate Data: Aggregate data based on user input
    - Qualified Key Annotation: aggregated_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@aggregated_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: Aggregator

Method Name	Artifact Keys
`aggregate`	aggregated_data

Versions​

v1.0.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. Aggregated Dimension Insights​

2. Time Series Trend Analysis​

Routine Methods​

1. Aggregate (Method)​

Interface Definitions​

Developer Docs​