TextEmbeddings

Versions

1.0.0

v1.0.0

Basic Information

Class Name: TextEmbeddings

Title: Text Embeddings

Version: 1.0.0

Author: Conor Hogan

Organization: OneStream

Creation Date: 2024-12-18

Default Routine Memory Capacity: 2 GB

Description

Short Description

Embed strings to vectorized format that can be used for cosine similarity analysis.

Long Description

Vectorize provided strings so that they can be used for cosine similarity analysis using this prebuilt routine. Routine can accept strings as a singular input, in a comma delimited list, or tabular data format. You can also choose to perform a cosine analysis between a vectorized string array and an input value.

Use Cases

1. LLM Preparation

A developer wishes to create a repository of text for an LLM use case where a user can ask to retrieve information that is most similar. By representing text as embeddings, we can measure the similarity between them as well as future text and group similar text together. The developer could also group similar texts together in this repository, which would in turn allow for the developer to make data retrieval even more efficient. This would all be particularly useful for large datasets; allowing for fast retrieval of information that is relevant while allowing for users to only provide natural language requests.

2. Sentiment Analysis

A developer wishes to develop a sentiment analysis model to ingest user reviews and provide a summary of the sentiment that they all contain. The text embedding routine can be used to analyze the sentiment behind user reviews, social media posts, or customer feedback sourced from anywhere that may contain relevant text. By converting this text into embeddings, we can train models to understand the underlying emotions and classify them as positive, negative, or neutral, and even provide a natural language summary of that sentiment. This helps a developer gauge customer satisfaction and address issues promptly.

3. Similarity Scoring for Document Retrieval

A developer wishes to develop an in-house repository of sensitive and varied documents from a variety of business departments such as HR, legal, technical training, etc. Documents could be embedded using the first of these routines and stored in a database for later retrieval. The cosine similarity scoring could then be used to retrieve the documents most similar to a user-supplied prompt. This would allow for many of the benefits of an LLM without the possibility of data leakage, by maintaining all of the proprietary documentation in-house and secure.

Routine Methods

1. Embed (Method)

Method: embed
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: Yes
- Method Limits: This method is significantly impacted by the number of embeddings being created, which is the number of rows if the input provides a dataset. With a large dataset containing 3M rows and the text column having several sentences in each row, this method can be expected to take around 4-5 hours to complete. This method is memory intensive, the previous example failed due to memory constraints when 50GB of memory was allocated, but completed when 115GB was allocated. For a smaller dataset containing just 20K rows, this method will complete in just 2 minutes with 5GB of memory allocated.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Run an embed routine to vectorize user-provided list of strings.
- Detailed Description:
  - Provide string(s) to be embedded in a vector format. Essentially, these are pieces of text that will be passed through a natural language processing model in order to put the strings in a format that allows them to be mathematically compared to each other. The vector corresponding to the text is intended to represent its definition, connotation, common usages, etc. Therefore, comparisons made between these vectors can be used to find the most similar other piece of text.
- Inputs:
  - Required Input
    - Input Type: Specify how you would like to provide the strings to embed.
      - Name: input_method
      - Tooltip:
        
        Detail:
        
        Click the dropdown arrow to select an input method.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be one of the following
        
        Tabular Data Reference
        
        Required Input
        
        Source Connection: The connection information source data.
        
        Name: data_connection
        
        Tooltip:
        
        Detail:
        
        Click on the drop-down box to specify your dataset source.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be an instance of Tabular Connection
        
        Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Column to Embed: Specify the column that contains the strings you wish to embed.
        
        Name: column
        
        Tooltip:
        
        Detail:
        
        Click on the combo box to select a column you want to use for embedding.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Output Specifications: Specify if you would like to include all of the input table's original columns in your output, or only the source string(s) and vectorized output.
        
        Name: output_choice
        
        Tooltip:
        
        Detail:
        
        Select the format in which you would like your output
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: TableOutputType_
        
        Input String(s)
        
        Required Input
        
        Text to Embed: Input text to embed in double pipe-delimited format.
        
        Name: input_text
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
- Artifacts:
  - Strings and Embedded Vectors: Parquet file containing the requested data table which represents both the strings that were embedded as well as the resulting embedding.
    - Qualified Key Annotation: embedded_text
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@embedded_text/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

2. String Matching (Method)

Method: string_matching
- Type: Method
- Memory Capacity: 2.0 GB
- Allow In-Memory Execution: No
- Read Only: Yes
- Method Limits: For a dataset with 20K rows, this method will complete quickly in just 2 minutes with 5GB of memory allocated. For a dataset with 300K rows, this method can be expected to take around 30 minutes to complete with 50GB of memory allocated. For datasets significantly larger than this, timeouts may be prone to occur. It is advised to break very large datasets into smaller chunks when performing string matching comparisons to avoid said timeouts.
- Outputs Dynamic Artifacts: No
- Short Description:
  - Run a string_matching routine to provide the top N closest matches by cosine similarity.
- Detailed Description:
  - Inputs will be the user-provided string and the user-provided tabular dataset column which contains embedded string vector data. The new string will be embedded into a vector format. That vector will then be compared to all the provided vector data using cosine similarity scoring. The routine will then output the starting data table with similarity scores added, trimmed and ordered to just the top N most similar strings.
- Inputs:
  - Required Input
    - Source Connection: The connection information source data.
      - Name: data_connection
      - Tooltip:
        
        Detail:
        
        Click on the drop-down box to specify your dataset source.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: Must be an instance of Tabular Connection
      - Nested Model: Tabular Connection
        
        Required Input
        
        Connection: The connection type to use to access the source data.
        
        Name: tabular_connection
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: Must be one of the following
        
        SQL Server Connection
        
        Required Input
        
        Database Resource: The name of the database resource to connect to.
        
        Name: database_resource
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Database Name: The name of the database to connect to.
        
        Name: database_name
        
        Tooltip:
        
        Detail:
        
        Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Table Name: The name of the table to use.
        
        Name: table_name
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Path: The full file path to the file to ingest.
        
        Name: file_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
        
        Partitioned MetaFileSystem Connection
        
        Required Input
        
        Connection Key: The MetaFileSystem connection key.
        
        Name: connection_key
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: MetaFileSystemConnectionKey
        
        File Type: The type of files to read from the directory.
        
        Name: file_type
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: FileExtensions_
        
        Directory Path: The full directory path containing partitioned tabular files.
        
        Name: directory_path
        
        Tooltip:
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
        
        Type: str
    - Column with Embeddings: Specify the column that contains the embeddings you wish to compare to.
      - Name: column
      - Tooltip:
        
        Detail:
        
        Click on the combo box to select a column you want to use for embedding.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
    - String to Compare: Input the text you'd like to compare to the provided embeddings.
      - Name: string_to_compare
      - Tooltip:
        
        Detail:
        
        Click on the text box to input the text.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: str
    - Number of Top Similarities to Return: Select the number of results you'd like returned by the routine (sorted by similarity scores).
      - Name: num_similarities_to_return
      - Tooltip:
        
        Detail:
        
        Click on the combo box to select the number of results you'd like returned.
        
        Validation Constraints:
        
        This input may be subject to other validation constraints at runtime.
      - Type: int
- Artifacts:
  - Strings and Embedded Vectors: Parquet file containing the requested data table which represents both the strings that were embedded as well as the resulting embedding.
    - Qualified Key Annotation: embedded_text
    - Aggregate Artifact: False
    - In-Memory Json Accessible: False
    - File Annotations:
      - artifacts_/@embedded_text/data_/data_<int>.parquet
        
        A partitioned set of parquet files where each file will have no more than 1000000 rows.

Interface Definitions

No interface definitions found for this routine

Developer Docs

Routine Typename: TextEmbeddings

Method Name	Artifact Keys
`embed`	embedded_text
`string_matching`	embedded_text

Versions​

v1.0.0​

Basic Information​

Tags​

Description​

Short Description​

Long Description​

Use Cases​

1. LLM Preparation​

2. Sentiment Analysis​

3. Similarity Scoring for Document Retrieval​

Routine Methods​

1. Embed (Method)​

2. String Matching (Method)​

Interface Definitions​

Developer Docs​