PrincipalComponentAnalysisRoutine

Versions

1.0.0

v1.0.0

Basic Information

Class Name: PrincipalComponentAnalysisRoutine

Title: Principal Component Analysis

Version: 1.0.0

Author: Josh Liu

Organization: OneStream

Creation Date: 2024-07-15

Default Routine Memory Capacity: 2.0 GB

Description

Short Description

The Principal Component Analysis algorithm for dimensionality reduction.

Long Description

Principal Component Analysis is a statistical method leveraged for dimensionality reduction, data compression, and feature extraction. The goal of PCA is to capture the greatest variances in the data along the axes which are called principal components. This transformation of the data via PCA reduces the dimensionality of the data while retaining as much variability as possible.

Use Cases

1. Anomaly Detection in Fraudulent Transactions

In time series data, anomalies are data points that deviate significantly from the normal pattern of the data. Detecting fraudulent transactions involves identifying unusual patterns in transaction volumes or values. PCA can be used to reduce the dimensionality of the data to identify the principal components that capture the majority of the variability. With this transformation of the data, data points that do not fit well within this reduced-dimensional space can be identified as anomalous data.

2. Forecasting Electricity Demand

Forecasting electricity demand involves data with various features such as temperature, day of the week, time of day, and historical usage patterns. Each independent variable may contribute to the overall pattern in different ways. PCA can identify the most significant components that explain the majority of the variance in the data. This will reduce the complexity of the forecasting model. The principal components derived from PCA can be used as input features for forecasting models such as ARIMA, neural networks, and other regression models.

Routine Methods

1. Run Pca (Method)

Method: run_pca
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: Yes
- Method Limits: This method has been tested with various datasets. With a dataset containing 5K targets and 550K rows, this method completed in 1 minute with 2 GB of memory allocated. With a dataset containing 10K targets and 1.1M rows, this method completed in 1 minute with 2 GB of memory allocated. With a dataset containing 40K targets and 29M rows, this method completed in 2 minutes with 10 GB of memory allocated. With a dataset containing 15K targets and 7.5M rows, this method completed in 1 minute with 5 GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Main method for the Principal Component Analysis routine.
- Detailed Description:
  - This method will deseasonalize the data, scale the data, and run Principal Component Analysis.
- Inputs:
  - Required Input
    - Source Data Definition: The source data definition.
      - Name: source_data_definition
      - Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Time Series Source Data
      - Nested Model: Time Series Source Data
        
        Required Input
        
        Connection: The connection to the source data.
        
        Name: data_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be an instance of Tabular Connection
        
        Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Dimension Columns: The columns to use as dimensions.
        
        Name: dimension_columns
        
        Tooltip:
        
        Validation Constraints:
        
        The input must have a minimum length of 1.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: list[str]
        
        Date Column: The column to use as the date.
        
        Name: date_column
        
        Tooltip:
        
        Detail:
        
        The date column must in a DateTime readable format.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Value Column: The column to use as the value.
        
        Name: value_column
        
        Tooltip:
        
        Detail:
        
        The value column must be a numeric (int, float, double, decimal, etc.) column.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Number of Principal Components: Choose to select a single number of components value or a range of number of components values.
      - Name: n_components_option
      - Tooltip:
        
        Detail:
        
        Recommend 2-3 principal components if the goal is data visualization. Number of principal components must be less than number of dataset dimensions.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be one of the following
        
        Single
        
        Required Input
        
        Single Number of Components: The number of components to decompose the dataset.
        
        Name: n_components
        
        Tooltip:
        
        Detail:
        
        The number of principal components must be less than the total number of dimensions in the data.
        
        Validation Constraints:
        
        The input must be greater than 0.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: int
        
        Range
        
        Required Input
        
        Lower Bound Number of Components: The lowest number of components to decompose the dataset.
        
        Name: n_components_start
        
        Tooltip:
        
        Detail:
        
        The lower bound number of components needs to be greater than or equal to 1 and less than the upper bound number of components.
        
        Validation Constraints:
        
        The input must be greater than 0.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: int
        
        Upper Bound Number of Components: The highest number of components to decompose the dataset.
        
        Name: n_components_end
        
        Tooltip:
        
        Detail:
        
        The highest number of components need to be less than or equal to the number of dimensions in the dataset.
        
        Validation Constraints:
        
        The input must be greater than 1.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: int
    - Data Stationarity: Specify if the time series data is considered stationary.
      - Name: stationarity_data_state
      - Tooltip:
        
        Detail:
        
        A time series dataset is considered stationary if its statistical properties, such as the mean or variance, do not change over time. Non-stationary time series data statistical properties change over time.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: bool
    - Perform Seasonal Decomposition: Specify if the time series data should be seasonally decomposed.
      - Name: seasonal_decomposition
      - Tooltip:
        
        Detail:
        
        Seasonal decomposition is a method of stationarizing the dataset by deseasonalizing. Many statistical and machine learning models often assume data is stationary.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: bool
    - Data Scaling Method: Specify the type of scaling or standardization to transform the dataset.
      - Name: scale_standardize_data
      - Tooltip:
        
        Detail:
        
        Data scaling/standardization is recommended if difference in scale between dimension values are large.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: ScaleType_
  - Optional Input
    - advanced_pca_parameters: Specify optional advanced PCA settings.
      - Name: set_advanced_settings
      - Tooltip:
        
        Detail:
        
        If set to False, advanced settings will be set to default values. For more information on PCA advanced settings, read scikit learn PCA documentation.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of PCA Advanced Parameters
      - Nested Model: PCA Advanced Parameters
        
        Required Input
        
        Whiten: Whitening removes some information from the transformed signal but can improve predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
        
        Name: whiten
        
        Tooltip:
        
        Detail:
        
        Multiplies components vectors by the square root of n_samples and then divides by the singular values.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: bool
        
        Svd Solver: Select the algorithm that will be used to perform singular value decomposition.
        
        Name: svd_solver
        
        Tooltip:
        
        Detail:
        
        For more information of svd_solver, reference scikit-learn PCA documentation.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: SvdSolverEnum_
        
        Tolerance: Tolerance for singular values computed by svd_solver.
        
        Name: tol
        
        Tooltip:
        
        Detail:
        
        Should only be changed if svd_solver set to arpack
        
        Validation Constraints:
        
        The input must be greater than or equal to 0.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: float
        
        Number of Iterations: Number of iterations for the power method computed by svd_solver. Input must be a positive integer or "auto".
        
        Name: iterated_power
        
        Tooltip:
        
        Detail:
        
        Should only be changed if svd_solver set to randomized
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: int | str
        
        Number of Over Samples: Number of additional random vectors to sample for proper conditioning.
        
        Name: n_oversamples
        
        Tooltip:
        
        Detail:
        
        Should only be changed if svd_solver set to randomized
        
        Validation Constraints:
        
        The input must be greater than 0.
        
        This input may be subject to other validation constraints at runtime.
        
        Type: int
        
        Power Iteration Normalizer: Power iteration normalizer for randomized svd_solver.
        
        Name: power_iter_norm
        
        Tooltip:
        
        Detail:
        
        Should only be changed if svd_solver set to randomized
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: PowerIterNormEnum_
        
        Optional Input
        
        Random State: Pass an integer for reproducible results.
        
        Name: random_state
        
        Tooltip:
        
        Detail:
        
        Should only be changed if svd_solver set to arpack or randomized
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Optional[int]
- Artifacts:
  - Preprocessed Dataframe: The state of the data after preprocessing. This is the state of the data that is input into the PCA model.
    - Qualified Key Annotation: preprocessed_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@preprocessed_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Principal Components Data: Dataset containing timepoints, value column, and principal component columns.
    - Qualified Key Annotation: principal_component_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@principal_component_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - Components Dataframe: The directions of maximum variance in the data.
    - Qualified Key Annotation: components_data
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@components_data/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.
  - PCA Report: A comprehensive PDF report of the dataset along with the HTML content used to generate the PDF.
    - Qualified Key Annotation: pca_report
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@pca_report/data_/document.pdf
        
        A pdf variant of the html file. Please note the interactivity that may be found in the html is lost within the pdf variant.
      - artifacts_/@pca_report/data_/html_content.html
        
        The html content.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: PrincipalComponentAnalysisRoutine

Method Name	Artifact Keys
`run_pca`	preprocessed_data, principal_component_data, components_data, pca_report

Versions​

v1.0.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. Anomaly Detection in Fraudulent Transactions​

2. Forecasting Electricity Demand​

Routine Methods​

1. Run Pca (Method)​

Interface Definitions​

Developer Docs​