DataCleanser
Versions
v1.0.0
Basic Information
Class Name: DataCleanser
Title: Replace Special Characters
Version: 1.0.0
Author: Jeff Robinson
Organization: OneStream
Creation Date: 2024-02-27
Default Routine Memory Capacity: 2 GB
Tags
Data Transformation, Data Analysis, Time Series, Interpretability, Data Preprocessing, Data Cleansing, Anomaly
Description
Short Description
Cleanse special characters from a dataset based on a defined schema.
Long Description
Users can easily target specific columns within their datasets to identify and replace special characters, ensuring data consistency and validity. This powerful and intuitive routine allows for focused cleansing of any dataset. Simply provide the column name, the special character you wish to find, and the replacement character. This routine allows a user to custom tailor the data cleanse by defining a find-and-replace schema for each column of a dataset.
Use Cases
1. Replace Invalid Ampersand Sign
In the realm of time series forecasting, particularly for retail analytics and inventory management, the consistency and accuracy of entity identifiers such as store and department names are crucial. Discrepancies in these identifiers, often due to typographical errors or the inclusion of special characters not recognized by analytical engines, can lead to significant challenges in data aggregation, trend analysis, and ultimately, the accuracy of forecasts. Special characters in names (e.g., "Store&Co", "Dept&Electronics") may arise from manual data entry errors, inconsistencies in data collection methods, or integration of disparate data sources. The "DataCleanser" routine is a critical tool for users, data scientists, and consultants tasked with preparing datasets for time series forecasting models. By enabling users to identify and replace unwanted special characters in pivotal text fields, this routine helps standardize store and department names across the dataset. For instance, transforming "Store&Co" into "Store_Co" and "Dept&Electronics" into "Dept_Electronics" not only ensures uniformity but also prevents potential data processing errors that could emerge from unrecognized characters. This preprocessing step is especially beneficial when consolidating data from multiple sources for comprehensive time series analyses. Clean, standardized names facilitate accurate matching and aggregation of data records, ensuring that all relevant transactions are correctly attributed to their respective entities. This uniformity is critical for generating reliable insights into sales trends, seasonal demand fluctuations, and inventory needs, directly impacting the effectiveness of predictive models used for planning and decision-making processes.
2. Replace Brackets
In the landscape of data analysis, the accuracy and standardization of data are of utmost importance. Discrepancies in data, often arising from the inclusion of unrecognized special characters, will cause processing errors, hinder comparative analysis, and, ultimately, the accuracy of financial insights. Special characters in text, such as the brackets ({}), may be introduced for formatting purposes (i.e., dormant investments), through manual data entry errors, inconsistencies in data formatting practices, or the consolidation of data from diverse sources (i.e., mergers & legacy systems). The "DataCleanser" routine becomes a critical tool for users, data scientists, and consultants involved in organizing and analyzing datasets for analysis or further consumption. This routine enables users to locate and eliminate unwanted characters from key string fields, thereby cleansing the data across the dataset. For instance, transforming "{PennyBlack}" into "PennyBlack" and "{Appetizers" into "Appetizers". A user will provide the column (field) name, special character to find, and the replacement character for as many instances as they would like to find. In the case of our "{PennyBlack}" example, the user would indicate the column name (i.e., "Store"), replace "{" with "" and then "Store" and replace "}" with "". This preprocessing step is particularly important before ingesting data into SensibleAI Forecast. Unrecognized characters will cause errors and prevent a number of items from performing correctly. This routine pairs well with anomaly routines that have identified special characters or data discrepancies.
Routine Methods
1. Cleanse Data (Method)
- Method:
cleanse_data-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: Yes
-
Method Limits: Memory usage scales with dataset size. For a dataset with 2M rows, this method is expected to complete in around 24 seconds with 10GB of memory allocated. For 5M rows, this method is expected to complete in around 60 seconds with 15GB of memory allocated. For 10M rows, this method is expected to complete in around 100 seconds with 20GB of memory allocated.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Run a routine to find and replace special characters based on column.
-
Detailed Description:
- Multiple replacements can be input in key-value pair format.
-
Inputs:
- Required Input
- Source Connection: The connection information source data.
- Name:
data_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Find & Replace Special Character: List of operation columns, special characters, and replacement characters.
- Name:
data_cleanse_definitions - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[DataCleanseDefinition]
- Name:
- Source Connection: The connection information source data.
- Required Input
-
Artifacts:
- Cleansed Data: Find & Replace based on user input
- Qualified Key Annotation:
cleanse_data - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@cleanse_data/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
- Cleansed Data: Find & Replace based on user input
-
Interface Definitions
No interface definitions found for this routine
Developer Docs
Routine Typename: DataCleanser
| Method Name | Artifact Keys |
|---|---|
cleanse_data | cleanse_data |