PrincipalComponentAnalysisRoutine
Versions
v1.0.0
Basic Information
Class Name: PrincipalComponentAnalysisRoutine
Title: Principal Component Analysis
Version: 1.0.0
Author: Josh Liu
Organization: OneStream
Creation Date: 2024-07-15
Default Routine Memory Capacity: 2.0 GB
Tags
ML, Time Series, Data Transformation, Dimensionality Reduction, Feature Generation
Description
Short Description
The Principal Component Analysis algorithm for dimensionality reduction.
Long Description
Principal Component Analysis is a statistical method leveraged for dimensionality reduction, data compression, and feature extraction. The goal of PCA is to capture the greatest variances in the data along the axes which are called principal components. This transformation of the data via PCA reduces the dimensionality of the data while retaining as much variability as possible.
Use Cases
1. Anomaly Detection in Fraudulent Transactions
In time series data, anomalies are data points that deviate significantly from the normal pattern of the data. Detecting fraudulent transactions involves identifying unusual patterns in transaction volumes or values. PCA can be used to reduce the dimensionality of the data to identify the principal components that capture the majority of the variability. With this transformation of the data, data points that do not fit well within this reduced-dimensional space can be identified as anomalous data.
2. Forecasting Electricity Demand
Forecasting electricity demand involves data with various features such as temperature, day of the week, time of day, and historical usage patterns. Each independent variable may contribute to the overall pattern in different ways. PCA can identify the most significant components that explain the majority of the variance in the data. This will reduce the complexity of the forecasting model. The principal components derived from PCA can be used as input features for forecasting models such as ARIMA, neural networks, and other regression models.
Routine Methods
1. Run Pca (Method)
- Method:
run_pca-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: Yes
-
Method Limits: This method has been tested with various datasets. With a dataset containing 5K targets and 550K rows, this method completed in 1 minute with 2 GB of memory allocated. With a dataset containing 10K targets and 1.1M rows, this method completed in 1 minute with 2 GB of memory allocated. With a dataset containing 40K targets and 29M rows, this method completed in 2 minutes with 10 GB of memory allocated. With a dataset containing 15K targets and 7.5M rows, this method completed in 1 minute with 5 GB of memory allocated.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Main method for the Principal Component Analysis routine.
-
Detailed Description:
- This method will deseasonalize the data, scale the data, and run Principal Component Analysis.
-
Inputs:
- Required Input
- Source Data Definition: The source data definition.
- Name:
source_data_definition - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Time Series Source Data
- Nested Model: Time Series Source Data
- Required Input
- Connection: The connection to the source data.
- Name:
data_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Dimension Columns: The columns to use as dimensions.
- Name:
dimension_columns - Tooltip:
- Validation Constraints:
- The input must have a minimum length of 1.
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: list[str]
- Name:
- Date Column: The column to use as the date.
- Name:
date_column - Tooltip:
- Detail:
- The date column must in a DateTime readable format.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Value Column: The column to use as the value.
- Name:
value_column - Tooltip:
- Detail:
- The value column must be a numeric (int, float, double, decimal, etc.) column.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Connection: The connection to the source data.
- Required Input
- Name:
- Number of Principal Components: Choose to select a single number of components value or a range of number of components values.
- Name:
n_components_option - Tooltip:
- Detail:
- Recommend 2-3 principal components if the goal is data visualization. Number of principal components must be less than number of dataset dimensions.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Must be one of the following
- Single
- Required Input
- Single Number of Components: The number of components to decompose the dataset.
- Name:
n_components - Tooltip:
- Detail:
- The number of principal components must be less than the total number of dimensions in the data.
- Validation Constraints:
- The input must be greater than 0.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Single Number of Components: The number of components to decompose the dataset.
- Required Input
- Range
- Required Input
- Lower Bound Number of Components: The lowest number of components to decompose the dataset.
- Name:
n_components_start - Tooltip:
- Detail:
- The lower bound number of components needs to be greater than or equal to 1 and less than the upper bound number of components.
- Validation Constraints:
- The input must be greater than 0.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Upper Bound Number of Components: The highest number of components to decompose the dataset.
- Name:
n_components_end - Tooltip:
- Detail:
- The highest number of components need to be less than or equal to the number of dimensions in the dataset.
- Validation Constraints:
- The input must be greater than 1.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Lower Bound Number of Components: The lowest number of components to decompose the dataset.
- Required Input
- Single
- Name:
- Data Stationarity: Specify if the time series data is considered stationary.
- Name:
stationarity_data_state - Tooltip:
- Detail:
- A time series dataset is considered stationary if its statistical properties, such as the mean or variance, do not change over time. Non-stationary time series data statistical properties change over time.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: bool
- Name:
- Perform Seasonal Decomposition: Specify if the time series data should be seasonally decomposed.
- Name:
seasonal_decomposition - Tooltip:
- Detail:
- Seasonal decomposition is a method of stationarizing the dataset by deseasonalizing. Many statistical and machine learning models often assume data is stationary.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: bool
- Name:
- Data Scaling Method: Specify the type of scaling or standardization to transform the dataset.
- Name:
scale_standardize_data - Tooltip:
- Detail:
- Data scaling/standardization is recommended if difference in scale between dimension values are large.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: ScaleType_
- Name:
- Source Data Definition: The source data definition.
- Optional Input
- advanced_pca_parameters: Specify optional advanced PCA settings.
- Name:
set_advanced_settings - Tooltip:
- Detail:
- If set to False, advanced settings will be set to default values. For more information on PCA advanced settings, read scikit learn PCA documentation.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Must be an instance of PCA Advanced Parameters
- Nested Model: PCA Advanced Parameters
- Required Input
- Whiten: Whitening removes some information from the transformed signal but can improve predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
- Name:
whiten - Tooltip:
- Detail:
- Multiplies components vectors by the square root of n_samples and then divides by the singular values.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: bool
- Name:
- Svd Solver: Select the algorithm that will be used to perform singular value decomposition.
- Name:
svd_solver - Tooltip:
- Detail:
- For more information of svd_solver, reference scikit-learn PCA documentation.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: SvdSolverEnum_
- Name:
- Tolerance: Tolerance for singular values computed by svd_solver.
- Name:
tol - Tooltip:
- Detail:
- Should only be changed if svd_solver set to arpack
- Validation Constraints:
- The input must be greater than or equal to 0.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: float
- Name:
- Number of Iterations: Number of iterations for the power method computed by svd_solver. Input must be a positive integer or "auto".
- Name:
iterated_power - Tooltip:
- Detail:
- Should only be changed if svd_solver set to randomized
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int | str
- Name:
- Number of Over Samples: Number of additional random vectors to sample for proper conditioning.
- Name:
n_oversamples - Tooltip:
- Detail:
- Should only be changed if svd_solver set to randomized
- Validation Constraints:
- The input must be greater than 0.
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Power Iteration Normalizer: Power iteration normalizer for randomized svd_solver.
- Name:
power_iter_norm - Tooltip:
- Detail:
- Should only be changed if svd_solver set to randomized
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: PowerIterNormEnum_
- Name:
- Whiten: Whitening removes some information from the transformed signal but can improve predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
- Optional Input
- Random State: Pass an integer for reproducible results.
- Name:
random_state - Tooltip:
- Detail:
- Should only be changed if svd_solver set to arpack or randomized
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Optional[int]
- Name:
- Random State: Pass an integer for reproducible results.
- Required Input
- Name:
- advanced_pca_parameters: Specify optional advanced PCA settings.
- Required Input
-
Artifacts:
-
Preprocessed Dataframe: The state of the data after preprocessing. This is the state of the data that is input into the PCA model.
- Qualified Key Annotation:
preprocessed_data - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@preprocessed_data/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
Principal Components Data: Dataset containing timepoints, value column, and principal component columns.
- Qualified Key Annotation:
principal_component_data - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@principal_component_data/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
Components Dataframe: The directions of maximum variance in the data.
- Qualified Key Annotation:
components_data - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@components_data/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
-
PCA Report: A comprehensive PDF report of the dataset along with the HTML content used to generate the PDF.
- Qualified Key Annotation:
pca_report - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@pca_report/data_/document.pdf- A pdf variant of the html file. Please note the interactivity that may be found in the html is lost within the pdf variant.
artifacts_/@pca_report/data_/html_content.html- The html content.
- Qualified Key Annotation:
-
-
Interface Definitions
No interface definitions found for this routine
Developer Docs
Routine Typename: PrincipalComponentAnalysisRoutine
| Method Name | Artifact Keys |
|---|---|
run_pca | preprocessed_data, principal_component_data, components_data, pca_report |