SourceDataAnalysis
Versions
v1.0.0
Basic Information
Class Name: SourceDataAnalysis
Title: Source Data Analysis
Version: 1.0.0
Author: Chris Bahr
Organization: OneStream
Creation Date: 2024-08-06
Default Routine Memory Capacity: 2.0 GB
Tags
Data Analysis, Data Visualization, Metrics
Description
Short Description
A generic tabular data analysis routine
Long Description
A routine that takes in a tabular dataset and generates an insightful HTML report. This routine allows users to glean insights from any tabular dataset, timeseries or not, with a single input.
Use Cases
1. Data Pipeline Monitoring
Consultants working on building robust data pipelines often need to integrate data from multiple sources, perform various transformations, and continuously monitor the quality and structure of the data throughout the process. Using this Source Data Analysis routine, consultants can easily inspect data at each stage of the pipeline. This method allows them to generate summary statistics, visualize data distributions, and detect anomalies or inconsistencies in real-time. By incorporating checkpoints within the pipeline, consultants can ensure that transformations are correctly applied and that the data remains clean and consistent. This capability is crucial for maintaining the integrity of the data, facilitating smooth downstream analysis, and ultimately delivering accurate and reliable insights to clients. Additionally, it aids in debugging and optimizing the pipeline, saving valuable time and resources.
2. Exploratory Tabular Data Analysis
In any data-driven project, ensuring the integrity and usefulness of tabular datasets is paramount. The general data analysis Python method is designed to handle a variety of tabular datasets, regardless of their source or format. Analysts can use this method to perform a comprehensive assessment of the dataset at hand. The method facilitates exploratory data analysis (EDA), allowing analysts to generate summary statistics, identify patterns, and visualize data distributions through various plots and charts. By applying this method, analysts can ensure that the tabular dataset is thoroughly understood and ready for in-depth analysis, thereby enhancing the accuracy and reliability of their findings. This capability is essential across various industries, including finance, healthcare, marketing, and more, where data-driven insights drive strategic decision-making.
Routine Methods
1. Source Data Analysis (Method)
- Method:
source_data_analysis-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: Yes
-
Method Limits: This method has been tested with a dataset with 10 columns and 1M rows, and completed in roughly 1 minute with 4GB of memory. Additionally, a dataset with 10 columns and 10M rows completed in roughly 4 minutes with 100GB of memory.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Create an HTML report to help users better understand a dataset.
-
Detailed Description:
- This routine helps users better understand any tabular dataset. It is meant to quickly generate an HTML report providing high-level statistics about the dataset, such as the number of columns (variables), rows (observations), missing values, and duplicate rows. Users can explore the report to view information about each column, and also view a sample of the first or last ten rows of the dataset.
-
Inputs:
- Required Input
- Source Connection Option: The connection type to use to access the source data.
- Name:
data_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Source Connection Option: The connection type to use to access the source data.
- Required Input
-
Artifacts:
- Data Analysis Report: An HTML report presenting analysis of a tabular dataset
- Qualified Key Annotation:
report_content - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@report_content/data_/html_content.html- The html content.
- Qualified Key Annotation:
- Data Analysis Report: An HTML report presenting analysis of a tabular dataset
-
Interface Definitions
No interface definitions found for this routine
Developer Docs
Routine Typename: SourceDataAnalysis
| Method Name | Artifact Keys |
|---|---|
source_data_analysis | report_content |