TextEmbeddings
Versions
v1.0.0
Basic Information
Class Name: TextEmbeddings
Title: Text Embeddings
Version: 1.0.0
Author: Conor Hogan
Organization: OneStream
Creation Date: 2024-12-18
Default Routine Memory Capacity: 2 GB
Tags
Embeddings, Text Classification, Classification, LLM, Natural Language Processing, Pattern Recognition, Reinforcement Learning
Description
Short Description
Embed strings to vectorized format that can be used for cosine similarity analysis.
Long Description
Vectorize provided strings so that they can be used for cosine similarity analysis using this prebuilt routine. Routine can accept strings as a singular input, in a comma delimited list, or tabular data format. You can also choose to perform a cosine analysis between a vectorized string array and an input value.
Use Cases
1. LLM Preparation
A developer wishes to create a repository of text for an LLM use case where a user can ask to retrieve information that is most similar. By representing text as embeddings, we can measure the similarity between them as well as future text and group similar text together. The developer could also group similar texts together in this repository, which would in turn allow for the developer to make data retrieval even more efficient. This would all be particularly useful for large datasets; allowing for fast retrieval of information that is relevant while allowing for users to only provide natural language requests.
2. Sentiment Analysis
A developer wishes to develop a sentiment analysis model to ingest user reviews and provide a summary of the sentiment that they all contain. The text embedding routine can be used to analyze the sentiment behind user reviews, social media posts, or customer feedback sourced from anywhere that may contain relevant text. By converting this text into embeddings, we can train models to understand the underlying emotions and classify them as positive, negative, or neutral, and even provide a natural language summary of that sentiment. This helps a developer gauge customer satisfaction and address issues promptly.
3. Similarity Scoring for Document Retrieval
A developer wishes to develop an in-house repository of sensitive and varied documents from a variety of business departments such as HR, legal, technical training, etc. Documents could be embedded using the first of these routines and stored in a database for later retrieval. The cosine similarity scoring could then be used to retrieve the documents most similar to a user-supplied prompt. This would allow for many of the benefits of an LLM without the possibility of data leakage, by maintaining all of the proprietary documentation in-house and secure.
Routine Methods
1. Embed (Method)
- Method:
embed-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: Yes
-
Method Limits: This method is significantly impacted by the number of embeddings being created, which is the number of rows if the input provides a dataset. With a large dataset containing 3M rows and the text column having several sentences in each row, this method can be expected to take around 4-5 hours to complete. This method is memory intensive, the previous example failed due to memory constraints when 50GB of memory was allocated, but completed when 115GB was allocated. For a smaller dataset containing just 20K rows, this method will complete in just 2 minutes with 5GB of memory allocated.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Run an embed routine to vectorize user-provided list of strings.
-
Detailed Description:
- Provide string(s) to be embedded in a vector format. Essentially, these are pieces of text that will be passed through a natural language processing model in order to put the strings in a format that allows them to be mathematically compared to each other. The vector corresponding to the text is intended to represent its definition, connotation, common usages, etc. Therefore, comparisons made between these vectors can be used to find the most similar other piece of text.
-
Inputs:
- Required Input
- Input Type: Specify how you would like to provide the strings to embed.
- Name:
input_method - Tooltip:
- Detail:
- Click the dropdown arrow to select an input method.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Must be one of the following
- Tabular Data Reference
- Required Input
- Source Connection: The connection information source data.
- Name:
data_connection - Tooltip:
- Detail:
- Click on the drop-down box to specify your dataset source.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Column to Embed: Specify the column that contains the strings you wish to embed.
- Name:
column - Tooltip:
- Detail:
- Click on the combo box to select a column you want to use for embedding.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Output Specifications: Specify if you would like to include all of the input table's original columns in your output, or only the source string(s) and vectorized output.
- Name:
output_choice - Tooltip:
- Detail:
- Select the format in which you would like your output
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: TableOutputType_
- Name:
- Source Connection: The connection information source data.
- Required Input
- Input String(s)
- Required Input
- Text to Embed: Input text to embed in double pipe-delimited format.
- Name:
input_text - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Text to Embed: Input text to embed in double pipe-delimited format.
- Required Input
- Tabular Data Reference
- Name:
- Input Type: Specify how you would like to provide the strings to embed.
- Required Input
-
Artifacts:
- Strings and Embedded Vectors: Parquet file containing the requested data table which represents both the strings that were embedded as well as the resulting embedding.
- Qualified Key Annotation:
embedded_text - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@embedded_text/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
- Strings and Embedded Vectors: Parquet file containing the requested data table which represents both the strings that were embedded as well as the resulting embedding.
-
2. String Matching (Method)
- Method:
string_matching-
Type: Method
-
Memory Capacity: 2.0 GB
-
Allow In-Memory Execution: No
-
Read Only: Yes
-
Method Limits: For a dataset with 20K rows, this method will complete quickly in just 2 minutes with 5GB of memory allocated. For a dataset with 300K rows, this method can be expected to take around 30 minutes to complete with 50GB of memory allocated. For datasets significantly larger than this, timeouts may be prone to occur. It is advised to break very large datasets into smaller chunks when performing string matching comparisons to avoid said timeouts.
-
Outputs Dynamic Artifacts: No
-
Short Description:
- Run a string_matching routine to provide the top N closest matches by cosine similarity.
-
Detailed Description:
- Inputs will be the user-provided string and the user-provided tabular dataset column which contains embedded string vector data. The new string will be embedded into a vector format. That vector will then be compared to all the provided vector data using cosine similarity scoring. The routine will then output the starting data table with similarity scores added, trimmed and ordered to just the top N most similar strings.
-
Inputs:
- Required Input
- Source Connection: The connection information source data.
- Name:
data_connection - Tooltip:
- Detail:
- Click on the drop-down box to specify your dataset source.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: Must be an instance of Tabular Connection
- Nested Model: Tabular Connection
- Required Input
- Connection: The connection type to use to access the source data.
- Name:
tabular_connection - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: Must be one of the following
- SQL Server Connection
- Required Input
- Database Resource: The name of the database resource to connect to.
- Name:
database_resource - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Name: The name of the database to connect to.
- Name:
database_name - Tooltip:
- Detail:
- Note: If you don’t see the database name that you are looking for in this list, it is recommended that you first move the data to be used within a database that is available within this list.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Table Name: The name of the table to use.
- Name:
table_name - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Database Resource: The name of the database resource to connect to.
- Required Input
- MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Path: The full file path to the file to ingest.
- Name:
file_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- Partitioned MetaFileSystem Connection
- Required Input
- Connection Key: The MetaFileSystem connection key.
- Name:
connection_key - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: MetaFileSystemConnectionKey
- Name:
- File Type: The type of files to read from the directory.
- Name:
file_type - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: FileExtensions_
- Name:
- Directory Path: The full directory path containing partitioned tabular files.
- Name:
directory_path - Tooltip:
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Validation Constraints:
- Type: str
- Name:
- Connection Key: The MetaFileSystem connection key.
- Required Input
- SQL Server Connection
- Name:
- Connection: The connection type to use to access the source data.
- Required Input
- Name:
- Column with Embeddings: Specify the column that contains the embeddings you wish to compare to.
- Name:
column - Tooltip:
- Detail:
- Click on the combo box to select a column you want to use for embedding.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- String to Compare: Input the text you'd like to compare to the provided embeddings.
- Name:
string_to_compare - Tooltip:
- Detail:
- Click on the text box to input the text.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: str
- Name:
- Number of Top Similarities to Return: Select the number of results you'd like returned by the routine (sorted by similarity scores).
- Name:
num_similarities_to_return - Tooltip:
- Detail:
- Click on the combo box to select the number of results you'd like returned.
- Validation Constraints:
- This input may be subject to other validation constraints at runtime.
- Detail:
- Type: int
- Name:
- Source Connection: The connection information source data.
- Required Input
-
Artifacts:
- Strings and Embedded Vectors: Parquet file containing the requested data table which represents both the strings that were embedded as well as the resulting embedding.
- Qualified Key Annotation:
embedded_text - Aggregate Artifact:
False - In-Memory Json Accessible:
False - File Annotations:
artifacts_/@embedded_text/data_/data_<int>.parquet- A partitioned set of parquet files where each file will have no more than 1000000 rows.
- Qualified Key Annotation:
- Strings and Embedded Vectors: Parquet file containing the requested data table which represents both the strings that were embedded as well as the resulting embedding.
-
Interface Definitions
No interface definitions found for this routine
Developer Docs
Routine Typename: TextEmbeddings
| Method Name | Artifact Keys |
|---|---|
embed | embedded_text |
string_matching | embedded_text |