DataCleanser

Versions

1.0.0

v1.0.0

Basic Information

Class Name: DataCleanser

Title: Replace Special Characters

Version: 1.0.0

Author: Jeff Robinson

Organization: OneStream

Creation Date: 2024-02-27

Default Routine Memory Capacity: 2 GB

Description

Short Description

Cleanse special characters from a dataset based on a defined schema.

Long Description

Users can easily target specific columns within their datasets to identify and replace special characters, ensuring data consistency and validity. This powerful and intuitive routine allows for focused cleansing of any dataset. Simply provide the column name, the special character you wish to find, and the replacement character. This routine allows a user to custom tailor the data cleanse by defining a find-and-replace schema for each column of a dataset.

Use Cases

1. Replace Invalid Ampersand Sign

In the realm of time series forecasting, particularly for retail analytics and inventory management, the consistency and accuracy of entity identifiers such as store and department names are crucial. Discrepancies in these identifiers, often due to typographical errors or the inclusion of special characters not recognized by analytical engines, can lead to significant challenges in data aggregation, trend analysis, and ultimately, the accuracy of forecasts. Special characters in names (e.g., "Store&Co", "Dept&Electronics") may arise from manual data entry errors, inconsistencies in data collection methods, or integration of disparate data sources. The "DataCleanser" routine is a critical tool for users, data scientists, and consultants tasked with preparing datasets for time series forecasting models. By enabling users to identify and replace unwanted special characters in pivotal text fields, this routine helps standardize store and department names across the dataset. For instance, transforming "Store&Co" into "Store_Co" and "Dept&Electronics" into "Dept_Electronics" not only ensures uniformity but also prevents potential data processing errors that could emerge from unrecognized characters. This preprocessing step is especially beneficial when consolidating data from multiple sources for comprehensive time series analyses. Clean, standardized names facilitate accurate matching and aggregation of data records, ensuring that all relevant transactions are correctly attributed to their respective entities. This uniformity is critical for generating reliable insights into sales trends, seasonal demand fluctuations, and inventory needs, directly impacting the effectiveness of predictive models used for planning and decision-making processes.

2. Replace Brackets

In the landscape of data analysis, the accuracy and standardization of data are of utmost importance. Discrepancies in data, often arising from the inclusion of unrecognized special characters, will cause processing errors, hinder comparative analysis, and, ultimately, the accuracy of financial insights. Special characters in text, such as the brackets ({}), may be introduced for formatting purposes (i.e., dormant investments), through manual data entry errors, inconsistencies in data formatting practices, or the consolidation of data from diverse sources (i.e., mergers & legacy systems). The "DataCleanser" routine becomes a critical tool for users, data scientists, and consultants involved in organizing and analyzing datasets for analysis or further consumption. This routine enables users to locate and eliminate unwanted characters from key string fields, thereby cleansing the data across the dataset. For instance, transforming "{PennyBlack}" into "PennyBlack" and "{Appetizers" into "Appetizers". A user will provide the column (field) name, special character to find, and the replacement character for as many instances as they would like to find. In the case of our "{PennyBlack}" example, the user would indicate the column name (i.e., "Store"), replace "{" with "" and then "Store" and replace "}" with "". This preprocessing step is particularly important before ingesting data into SensibleAI Forecast. Unrecognized characters will cause errors and prevent a number of items from performing correctly. This routine pairs well with anomaly routines that have identified special characters or data discrepancies.

Routine Methods

1. Cleanse Data (Method)

Method: cleanse_data
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: Yes
- Method Limits: Memory usage scales with dataset size. For a dataset with 2M rows, this method is expected to complete in around 24 seconds with 10GB of memory allocated. For 5M rows, this method is expected to complete in around 60 seconds with 15GB of memory allocated. For 10M rows, this method is expected to complete in around 100 seconds with 20GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Run a routine to find and replace special characters based on column.
- Detailed Description:
  - Multiple replacements can be input in key-value pair format.
- Inputs:
  - Required Input
    - Source Connection: The connection information source data.
      - Name: data_connection
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Find & Replace Special Character: List of operation columns, special characters, and replacement characters.
      - Name: data_cleanse_definitions
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: list[DataCleanseDefinition]
- Artifacts:
  - Cleansed Data: Find & Replace based on user input
    - Qualified Key Annotation: cleanse_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@cleanse_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: DataCleanser

Method Name	Artifact Keys
`cleanse_data`	cleanse_data

Versions​

v1.0.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. Replace Invalid Ampersand Sign​

2. Replace Brackets​

Routine Methods​

1. Cleanse Data (Method)​

Interface Definitions​

Developer Docs​