Skip to main content

DataCleanser

Versions

v1.0.0

Basic Information

Class Name: DataCleanser

Title: Replace Special Characters

Version: 1.0.0

Author: Jeff Robinson

Organization: OneStream

Creation Date: 2024-02-27

Default Routine Memory Capacity: 2 GB

Tags

Data Transformation, Data Analysis, Time Series, Interpretability, Data Preprocessing, Data Cleansing, Anomaly

Description

Short Description

Cleanse special characters from a dataset based on a defined schema.

Long Description

Users can easily target specific columns within their datasets to identify and replace special characters, ensuring data consistency and validity. This powerful and intuitive routine allows for focused cleansing of any dataset. Simply provide the column name, the special character you wish to find, and the replacement character. This routine allows a user to custom tailor the data cleanse by defining a find-and-replace schema for each column of a dataset.

Use Cases

1. Replace Invalid Ampersand Sign

In the realm of time series forecasting, particularly for retail analytics and inventory management, the consistency and accuracy of entity identifiers such as store and department names are crucial. Discrepancies in these identifiers, often due to typographical errors or the inclusion of special characters not recognized by analytical engines, can lead to significant challenges in data aggregation, trend analysis, and ultimately, the accuracy of forecasts. Special characters in names (e.g., "Store&Co", "Dept&Electronics") may arise from manual data entry errors, inconsistencies in data collection methods, or integration of disparate data sources. The "DataCleanser" routine is a critical tool for users, data scientists, and consultants tasked with preparing datasets for time series forecasting models. By enabling users to identify and replace unwanted special characters in pivotal text fields, this routine helps standardize store and department names across the dataset. For instance, transforming "Store&Co" into "Store_Co" and "Dept&Electronics" into "Dept_Electronics" not only ensures uniformity but also prevents potential data processing errors that could emerge from unrecognized characters. This preprocessing step is especially beneficial when consolidating data from multiple sources for comprehensive time series analyses. Clean, standardized names facilitate accurate matching and aggregation of data records, ensuring that all relevant transactions are correctly attributed to their respective entities. This uniformity is critical for generating reliable insights into sales trends, seasonal demand fluctuations, and inventory needs, directly impacting the effectiveness of predictive models used for planning and decision-making processes.

2. Replace Brackets

In the landscape of data analysis, the accuracy and standardization of data are of utmost importance. Discrepancies in data, often arising from the inclusion of unrecognized special characters, will cause processing errors, hinder comparative analysis, and, ultimately, the accuracy of financial insights. Special characters in text, such as the brackets ({}), may be introduced for formatting purposes (i.e., dormant investments), through manual data entry errors, inconsistencies in data formatting practices, or the consolidation of data from diverse sources (i.e., mergers & legacy systems). The "DataCleanser" routine becomes a critical tool for users, data scientists, and consultants involved in organizing and analyzing datasets for analysis or further consumption. This routine enables users to locate and eliminate unwanted characters from key string fields, thereby cleansing the data across the dataset. For instance, transforming "{PennyBlack}" into "PennyBlack" and "{Appetizers" into "Appetizers". A user will provide the column (field) name, special character to find, and the replacement character for as many instances as they would like to find. In the case of our "{PennyBlack}" example, the user would indicate the column name (i.e., "Store"), replace "{" with "" and then "Store" and replace "}" with "". This preprocessing step is particularly important before ingesting data into SensibleAI Forecast. Unrecognized characters will cause errors and prevent a number of items from performing correctly. This routine pairs well with anomaly routines that have identified special characters or data discrepancies.

Routine Methods

1. Cleanse Data (Method)
  • Method: cleanse_data
    • Type: Method

    • Memory Capacity: 2.0 GB

    • Allow In-Memory Execution: No

    • Read Only: Yes

    • Method Limits: Memory usage scales with dataset size. For a dataset with 2M rows, this method is expected to complete in around 24 seconds with 10GB of memory allocated. For 5M rows, this method is expected to complete in around 60 seconds with 15GB of memory allocated. For 10M rows, this method is expected to complete in around 100 seconds with 20GB of memory allocated.

    • Outputs Dynamic Artifacts: No

    • Short Description:

      • Run a routine to find and replace special characters based on column.
    • Detailed Description:

      • Multiple replacements can be input in key-value pair format.
    • Inputs:

      • Required Input
        • Source Connection: The connection information source data.
          • Name: data_connection
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: Must be an instance of Tabular Connection
          • Nested Model: Tabular Connection
            • Required Input
              • Connection: The connection type to use to access the source data.
                • Name: tabular_connection
                • Tooltip:
                  • Validation Constraints:
                    • This input may be subject to other validation constraints at runtime.
                • Type: Must be one of the following
                  • SQL Server Connection
                    • Required Input
                      • Database Resource: The name of the database resource to connect to.
                        • Name: database_resource
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Database Name: The name of the database to connect to.
                        • Name: database_name
                        • Tooltip:
                          • Detail:
                            • Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                      • Table Name: The name of the table to use.
                        • Name: table_name
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Path: The full file path to the file to ingest.
                        • Name: file_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
                  • Partitioned MetaFileSystem Connection
                    • Required Input
                      • Connection Key: The MetaFileSystem connection key.
                        • Name: connection_key
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: MetaFileSystemConnectionKey
                      • File Type: The type of files to read from the directory.
                        • Name: file_type
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: FileExtensions_
                      • Directory Path: The full directory path containing partitioned tabular files.
                        • Name: directory_path
                        • Tooltip:
                          • Validation Constraints:
                            • This input may be subject to other validation constraints at runtime.
                        • Type: str
        • Find & Replace Special Character: List of operation columns, special characters, and replacement characters.
          • Name: data_cleanse_definitions
          • Tooltip:
            • Validation Constraints:
              • This input may be subject to other validation constraints at runtime.
          • Type: list[DataCleanseDefinition]
    • Artifacts:

      • Cleansed Data: Find & Replace based on user input
        • Qualified Key Annotation: cleanse_data
        • Aggregate Artifact: False
        • In-Memory Json Accessible: False
        • File Annotations:
          • artifacts_/@cleanse_data/data_/data_<int>.parquet
            • A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: DataCleanser

Method NameArtifact Keys
cleanse_datacleanse_data

Was this page helpful?