How Linghui Luo’s research helps ensure code is checked and ready to deploy.Read More
A quick guide to Amazon’s papers at Interspeech 2023
Speech recognition predominates, but Amazon’s research takes in data representation, dialogue management, question answering, and more.Read More
Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift
Amazon Redshift is the most popular cloud data warehouse that is used by tens of thousands of customers to analyze exabytes of data every day. Many practitioners are extending these Redshift datasets at scale for machine learning (ML) using Amazon SageMaker, a fully managed ML service, with requirements to develop features offline in a code way or low-code/no-code way, store featured data from Amazon Redshift, and make this happen at scale in a production environment.
In this post, we show you three options to prepare Redshift source data at scale in SageMaker, including loading data from Amazon Redshift, performing feature engineering, and ingesting features into Amazon SageMaker Feature Store:
- Option A – Use an AWS Glue interactive session on Amazon SageMaker Studio (in a dev environment) and an AWS Glue job (in a prod environment) with Spark
- Option B – Use an Amazon SageMaker Processing job with a Redshift dataset definition, or use SageMaker Feature Processing in SageMaker Feature Store, which runs SageMaker training jobs
- Option C – Use Amazon SageMaker Data Wrangler in a low-code/no-code way
If you’re an AWS Glue user and would like to do the process interactively, consider option A. If you’re familiar with SageMaker and writing Spark code, option B could be your choice. If you want to do the process in a low-code/no-code way, you can follow option C.
Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and ML to deliver the best price-performance at any scale.
SageMaker Studio is the first fully integrated development environment (IDE) for ML. It provides a single web-based visual interface where you can perform all ML development steps, including preparing data and building, training, and deploying models.
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development. AWS Glue enables you to seamlessly collect, transform, cleanse, and prepare data for storage in your data lakes and data pipelines using a variety of capabilities, including built-in transforms.
Solution overview
The following diagram illustrates the solution architecture for each option.
Prerequisites
To continue with the examples in this post, you need to create the required AWS resources. To do this, we provide an AWS CloudFormation template to create a stack that contains the resources. When you create the stack, AWS creates a number of resources in your account:
- A SageMaker domain, which includes an associated Amazon Elastic File System (Amazon EFS) volume
- A list of authorized users and a variety of security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC) configurations
- A Redshift cluster
- A Redshift secret
- An AWS Glue connection for Amazon Redshift
- An AWS Lambda function to set up required resources, execution roles and policies
Make sure that you don’t have already two SageMaker Studio domains in the Region where you’re running the CloudFormation template. This is the maximum allowed number of domains in each supported Region.
Deploy the CloudFormation template
Complete the following steps to deploy the CloudFormation template:
- Save the CloudFormation template sm-redshift-demo-vpc-cfn-v1.yaml locally.
- On the AWS CloudFormation console, choose Create stack.
- For Prepare template, select Template is ready.
- For Template source, select Upload a template file.
- Choose Choose File and navigate to the location on your computer where the CloudFormation template was downloaded and choose the file.
- Enter a stack name, such as
Demo-Redshift
. - On the Configure stack options page, leave everything as default and choose Next.
- On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names and choose Create stack.
You should see a new CloudFormation stack with the name Demo-Redshift
being created. Wait for the status of the stack to be CREATE_COMPLETE (approximately 7 minutes) before moving on. You can navigate to the stack’s Resources tab to check what AWS resources were created.
Launch SageMaker Studio
Complete the following steps to launch your SageMaker Studio domain:
- On the SageMaker console, choose Domains in the navigation pane.
- Choose the domain you created as part of the CloudFormation stack (
SageMakerDemoDomain
). - Choose Launch and Studio.
This page can take 1–2 minutes to load when you access SageMaker Studio for the first time, after which you’ll be redirected to a Home tab.
Download the GitHub repository
Complete the following steps to download the GitHub repo:
- In the SageMaker notebook, on the File menu, choose New and Terminal.
- In the terminal, enter the following command:
You can now see the amazon-sagemaker-featurestore-redshift-integration
folder in navigation pane of SageMaker Studio.
Set up batch ingestion with the Spark connector
Complete the following steps to set up batch ingestion:
- In SageMaker Studio, open the notebook 1-uploadJar.ipynb under
amazon-sagemaker-featurestore-redshift-integration
. - If you are prompted to choose a kernel, choose Data Science as the image and Python 3 as the kernel, then choose Select.
- For the following notebooks, choose the same image and kernel except the AWS Glue Interactive Sessions notebook (4a).
- Run the cells by pressing Shift+Enter in each of the cells.
While the code runs, an asterisk (*) appears between the square brackets. When the code is finished running, the * will be replaced with numbers. This action is also workable for all other notebooks.
Set up the schema and load data to Amazon Redshift
The next step is to set up the schema and load data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift. To do so, run the notebook 2-loadredshiftdata.ipynb.
Create feature stores in SageMaker Feature Store
To create your feature stores, run the notebook 3-createFeatureStore.ipynb.
Perform feature engineering and ingest features into SageMaker Feature Store
In this section, we present the steps for all three options to perform feature engineering and ingest processed features into SageMaker Feature Store.
Option A: Use SageMaker Studio with a serverless AWS Glue interactive session
Complete the following steps for option A:
- In SageMaker Studio, open the notebook 4a-glue-int-session.ipynb.
- If you are prompted to choose a kernel, choose SparkAnalytics 2.0 as the image and Glue Python [PySpark and Ray] as the kernel, then choose Select.
The environment preparation process may take some time to complete.
Option B: Use a SageMaker Processing job with Spark
In this option, we use a SageMaker Processing job with a Spark script to load the original dataset from Amazon Redshift, perform feature engineering, and ingest the data into SageMaker Feature Store. To do so, open the notebook 4b-processing-rs-to-fs.ipynb in your SageMaker Studio environment.
Here we use RedshiftDatasetDefinition
to retrieve the dataset from the Redshift cluster. RedshiftDatasetDefinition
is one type of input of the processing job, which provides a simple interface for practitioners to configure Redshift connection-related parameters such as identifier, database, table, query string, and more. You can easily establish your Redshift connection using RedshiftDatasetDefinition
without maintaining a connection full time. We also use the SageMaker Feature Store Spark connector library in the processing job to connect to SageMaker Feature Store in a distributed environment. With this Spark connector, you can easily ingest data to the feature group’s online and offline store from a Spark DataFrame. Also, this connector contains the functionality to automatically load feature definitions to help with creating feature groups. Above all, this solution offers you a native Spark way to implement an end-to-end data pipeline from Amazon Redshift to SageMaker. You can perform any feature engineering in a Spark context and ingest final features into SageMaker Feature Store in just one Spark project.
To use the SageMaker Feature Store Spark connector, we extend a pre-built SageMaker Spark container with sagemaker-feature-store-pyspark
installed. In the Spark script, use the system executable command to run pip install
, install this library in your local environment, and get the local path of the JAR file dependency. In the processing job API, provide this path to the parameter of submit_jars
to the node of the Spark cluster that the processing job creates.
In the Spark script for the processing job, we first read the original dataset files from Amazon S3, which temporarily stores the unloaded dataset from Amazon Redshift as a medium. Then we perform feature engineering in a Spark way and use feature_store_pyspark
to ingest data into the offline feature store.
For the processing job, we provide a ProcessingInput
with a redshift_dataset_definition
. Here we build a structure according to the interface, providing Redshift connection-related configurations. You can use query_string
to filter your dataset by SQL and unload it to Amazon S3. See the following code:
You need to wait 6–7 minutes for each processing job including USER
, PLACE
, and RATING
datasets.
For more details about SageMaker Processing jobs, refer to Process data.
For SageMaker native solutions for feature processing from Amazon Redshift, you can also use Feature Processing in SageMaker Feature Store, which is for underlying infrastructure including provisioning the compute environments and creating and maintaining SageMaker pipelines to load and ingest data. You can only focus on your feature processor definitions that include transformation functions, the source of Amazon Redshift, and the sink of SageMaker Feature Store. The scheduling, job management, and other workloads in production are managed by SageMaker. Feature Processor pipelines are SageMaker pipelines, so the standard monitoring mechanisms and integrations are available.
Option C: Use SageMaker Data Wrangler
SageMaker Data Wrangler allows you to import data from various data sources including Amazon Redshift for a low-code/no-code way to prepare, transform, and featurize your data. After you finish data preparation, you can use SageMaker Data Wrangler to export features to SageMaker Feature Store.
There are some AWS Identity and Access Management (IAM) settings that allow SageMaker Data Wrangler to connect to Amazon Redshift. First, create an IAM role (for example, redshift-s3-dw-connect
) that includes an Amazon S3 access policy. For this post, we attached the AmazonS3FullAccess
policy to the IAM role. If you have restrictions of accessing a specified S3 bucket, you can define it in the Amazon S3 access policy. We attached the IAM role to the Redshift cluster that we created earlier. Next, create a policy for SageMaker to access Amazon Redshift by getting its cluster credentials, and attach the policy to the SageMaker IAM role. The policy looks like the following code:
After this setup, SageMaker Data Wrangler allows you to query Amazon Redshift and output the results into an S3 bucket. For instructions to connect to a Redshift cluster and query and import data from Amazon Redshift to SageMaker Data Wrangler, refer to Import data from Amazon Redshift.
SageMaker Data Wrangler offers a selection of over 300 pre-built data transformations for common use cases such as deleting duplicate rows, imputing missing data, one-hot encoding, and handling time series data. You can also add custom transformations in pandas or PySpark. In our example, we applied some transformations such as drop column, data type enforcement, and ordinal encoding to the data.
When your data flow is complete, you can export it to SageMaker Feature Store. At this point, you need to create a feature group: give the feature group a name, select both online and offline storage, provide the name of a S3 bucket to use for the offline store, and provide a role that has SageMaker Feature Store access. Finally, you can create a job, which creates a SageMaker Processing job that runs the SageMaker Data Wrangler flow to ingest features from the Redshift data source to your feature group.
Here is one end-to-end data flow in the scenario of PLACE feature engineering.
Use SageMaker Feature Store for model training and prediction
To use SageMaker Feature store for model training and prediction, open the notebook 5-classification-using-feature-groups.ipynb.
After the Redshift data is transformed into features and ingested into SageMaker Feature Store, the features are available for search and discovery across teams of data scientists responsible for many independent ML models and use cases. These teams can use the features for modeling without having to rebuild or rerun feature engineering pipelines. Feature groups are managed and scaled independently, and can be reused and joined together regardless of the upstream data source.
The next step is to build ML models using features selected from one or multiple feature groups. You decide which feature groups to use for your models. There are two options to create an ML dataset from feature groups, both utilizing the SageMaker Python SDK:
- Use the SageMaker Feature Store DatasetBuilder API – The SageMaker Feature Store
DatasetBuilder
API allows data scientists create ML datasets from one or more feature groups in the offline store. You can use the API to create a dataset from a single or multiple feature groups, and output it as a CSV file or a pandas DataFrame. See the following example code:
- Run SQL queries using the athena_query function in the FeatureGroup API – Another option is to use the auto-built AWS Glue Data Catalog for the FeatureGroup API. The FeatureGroup API includes an
Athena_query
function that creates an AthenaQuery instance to run user-defined SQL query strings. Then you run the Athena query and organize the query result into a pandas DataFrame. This option allows you to specify more complicated SQL queries to extract information from a feature group. See the following example code:
Next, we can merge the queried data from different feature groups into our final dataset for model training and testing. For this post, we use batch transform for model inference. Batch transform allows you to get model inferene on a bulk of data in Amazon S3, and its inference result is stored in Amazon S3 as well. For details on model training and inference, refer to the notebook 5-classification-using-feature-groups.ipynb.
Run a join query on prediction results in Amazon Redshift
Lastly, we query the inference result and join it with original user profiles in Amazon Redshift. To do this, we use Amazon Redshift Spectrum to join batch prediction results in Amazon S3 with the original Redshift data. For details, refer to the notebook run 6-read-results-in-redshift.ipynb.
Clean up
In this section, we provide the steps to clean up the resources created as part of this post to avoid ongoing charges.
Shut down SageMaker Apps
Complete the following steps to shut down your resources:
- In SageMaker Studio, on the File menu, choose Shut Down.
- In the Shutdown confirmation dialog, choose Shutdown All to proceed.
- After you get the “Server stopped” message, you can close this tab.
Delete the apps
Complete the following steps to delete your apps:
- On the SageMaker console, in the navigation pane, choose Domains.
- On the Domains page, choose
SageMakerDemoDomain
. - On the domain details page, under User profiles, choose the user
sagemakerdemouser
. - In the Apps section, in the Action column, choose Delete app for any active apps.
- Ensure that the Status column says Deleted for all the apps.
Delete the EFS storage volume associated with your SageMaker domain
Locate your EFS volume on the SageMaker console and delete it. For instructions, refer to Manage Your Amazon EFS Storage Volume in SageMaker Studio.
Delete default S3 buckets for SageMaker
Delete the default S3 buckets (sagemaker-<region-code>-<acct-id>
) for SageMaker If you are not using SageMaker in that Region.
Delete the CloudFormation stack
Delete the CloudFormation stack in your AWS account so as to clean up all related resources.
Conclusion
In this post, we demonstrated an end-to-end data and ML flow from a Redshift data warehouse to SageMaker. You can easily use AWS native integration of purpose-built engines to go through the data journey seamlessly. Check out the AWS Blog for more practices about building ML features from a modern data warehouse.
About the Authors
Akhilesh Dube, a Senior Analytics Solutions Architect at AWS, possesses more than two decades of expertise in working with databases and analytics products. His primary role involves collaborating with enterprise clients to design robust data analytics solutions while offering comprehensive technical guidance on a wide range of AWS Analytics and AI/ML services.
Ren Guo is a Senior Data Specialist Solutions Architect in the domains of generative AI, analytics, and traditional AI/ML at AWS, Greater China Region.
Sherry Ding is a Senior AI/ML Specialist Solutions Architect. She has extensive experience in machine learning with a PhD degree in Computer Science. She mainly works with Public Sector customers on various AI/ML-related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS Certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.
Unlocking efficiency: Harnessing the power of Selective Execution in Amazon SageMaker Pipelines
MLOps is a key discipline that often oversees the path to productionizing machine learning (ML) models. It’s natural to focus on a single model that you want to train and deploy. However, in reality, you’ll likely work with dozens or even hundreds of models, and the process may involve multiple complex steps. Therefore, it’s important to have the infrastructure in place to track, train, deploy, and monitor models with varying complexities at scale. This is where MLOps tooling comes in. MLOps tooling helps you repeatably and reliably build and simplify these processes into a workflow that is tailored for ML.
Amazon SageMaker Pipelines, a feature of Amazon SageMaker, is a purpose-built workflow orchestration service for ML that helps you automate end-to-end ML workflows at scale. It simplifies the development and maintenance of ML models by providing a centralized platform to orchestrate tasks such as data preparation, model training, tuning and validation. SageMaker Pipelines can help you streamline workflow management, accelerate experimentation and retrain models more easily.
In this post, we spotlight an exciting new feature of SageMaker Pipelines known as Selective Execution. This new feature empowers you to selectively run specific portions of your ML workflow, resulting in significant time and compute resource savings by limiting the run to pipeline steps in scope and eliminating the need to run steps out of scope. Furthermore, we explore various use cases where the advantages of utilizing Selective Execution become evident, further solidifying its value proposition.
Solution overview
SageMaker Pipelines continues to innovate its developer experience with the release of Selective Execution. ML builders now have the ability to choose specific steps to run within a pipeline, eliminating the need to rerun the entire pipeline. This feature enables you to rerun specific sections of the pipeline while modifying the runtime parameters associated with the selected steps.
It’s important to note that the selected steps may rely on the results of non-selected steps. In such cases, the outputs of these non-selected steps are reused from a reference run of the current pipeline version. This means that the reference run must have already completed. The default reference run is the latest run of the current pipeline version, but you can also choose to use a different run of the current pipeline version as a reference.
The overall state of the reference run must be Successful, Failed or Stopped. It cannot be Running when Selective Execution attempts to use its outputs. When using Selective Execution, you can choose any number of steps to run, as long as they form a contiguous portion of the pipeline.
The following diagram illustrates the pipeline behavior with a full run.
The following diagram illustrates the pipeline behavior using Selective Execution.
In the following sections, we show how to use Selective Execution for various scenarios, including complex workflows in pipeline Direct Acyclic Graphs (DAGs).
Prerequisites
To start experimenting with Selective Execution, we need to first set up the following components of your SageMaker environment:
- SageMaker Python SDK – Ensure that you have an updated SageMaker Python SDK installed in your Python environment. You can run the following command from your notebook or terminal to install or upgrade the SageMaker Python SDK version to 2.162.0 or higher:
python3 -m pip install sagemaker>=2.162.0
orpip3 install sagemaker>=2.162.0
. - Access to SageMaker Studio (optional) – Amazon SageMaker Studio can be helpful for visualizing pipeline runs and interacting with preexisting pipeline ARNs visually. If you don’t have access to SageMaker Studio or are using on-demand notebooks or other IDEs, you can still follow this post and interact with your pipeline ARNs using the Python SDK.
The sample code for a full end-to-end walkthrough is available in the GitHub repo.
Setup
With the sagemaker>=1.162.0
Python SDK, we introduced the SelectiveExecutionConfig
class as part of the sagemaker.workflow.selective_execution_config
module. The Selective Execution feature relies on a pipeline ARN that has been previously marked as Succeeded, Failed or Stopped. The following code snippet demonstrates how to import the SelectiveExecutionConfig
class, retrieve the reference pipeline ARN, and gather associated pipeline steps and runtime parameters governing the pipeline run:
import boto3
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.selective_execution_config import SelectiveExecutionConfig
sm_client = boto3.client('sagemaker')
# reference the name of your sample pipeline
pipeline_name = "AbalonePipeline"
# filter for previous success pipeline execution arns
pipeline_executions = [_exec
for _exec in Pipeline(name=pipeline_name).list_executions()['PipelineExecutionSummaries']
if _exec['PipelineExecutionStatus'] == "Succeeded"
]
# get the last successful execution
latest_pipeline_arn = pipeline_executions[0]['PipelineExecutionArn']
print(latest_pipeline_arn)
>>> arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/x62pbar3gs6h
# list all steps of your sample pipeline
execution_steps = sm_client.list_pipeline_execution_steps(
PipelineExecutionArn=latest_pipeline_arn
)['PipelineExecutionSteps']
print(execution_steps)
>>>
[{'StepName': 'Abalone-Preprocess',
'StartTime': datetime.datetime(2023, 6, 27, 4, 41, 30, 519000, tzinfo=tzlocal()),
'EndTime': datetime.datetime(2023, 6, 27, 4, 41, 30, 986000, tzinfo=tzlocal()),
'StepStatus': 'Succeeded',
'AttemptCount': 0,
'Metadata': {'ProcessingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:123123123123:processing-job/pipelines-fvsmu7m7ki3q-Abalone-Preprocess-d68CecvHLU'}},
'SelectiveExecutionResult': {'SourcePipelineExecutionArn': 'arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/ksm2mjwut6oz'}},
{'StepName': 'Abalone-Train',
'StartTime': datetime.datetime(2023, 6, 27, 4, 41, 31, 320000, tzinfo=tzlocal()),
'EndTime': datetime.datetime(2023, 6, 27, 4, 43, 58, 224000, tzinfo=tzlocal()),
'StepStatus': 'Succeeded',
'AttemptCount': 0,
'Metadata': {'TrainingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:123123123123:training-job/pipelines-x62pbar3gs6h-Abalone-Train-PKhAc1Q6lx'}}},
{'StepName': 'Abalone-Evaluate',
'StartTime': datetime.datetime(2023, 6, 27, 4, 43, 59, 40000, tzinfo=tzlocal()),
'EndTime': datetime.datetime(2023, 6, 27, 4, 57, 43, 76000, tzinfo=tzlocal()),
'StepStatus': 'Succeeded',
'AttemptCount': 0,
'Metadata': {'ProcessingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:123123123123:processing-job/pipelines-x62pbar3gs6h-Abalone-Evaluate-vmkZDKDwhk'}}},
{'StepName': 'Abalone-MSECheck',
'StartTime': datetime.datetime(2023, 6, 27, 4, 57, 43, 821000, tzinfo=tzlocal()),
'EndTime': datetime.datetime(2023, 6, 27, 4, 57, 44, 124000, tzinfo=tzlocal()),
'StepStatus': 'Succeeded',
'AttemptCount': 0,
'Metadata': {'Condition': {'Outcome': 'True'}}}]
# list all configureable pipeline parameters
# params can be altered during selective execution
parameters = sm_client.list_pipeline_parameters_for_execution(
PipelineExecutionArn=latest_pipeline_arn
)['PipelineParameters']
print(parameters)
>>>
[{'Name': 'XGBNumRounds', 'Value': '120'},
{'Name': 'XGBSubSample', 'Value': '0.9'},
{'Name': 'XGBGamma', 'Value': '2'},
{'Name': 'TrainingInstanceCount', 'Value': '1'},
{'Name': 'XGBMinChildWeight', 'Value': '4'},
{'Name': 'XGBETA', 'Value': '0.25'},
{'Name': 'ApprovalStatus', 'Value': 'PendingManualApproval'},
{'Name': 'ProcessingInstanceCount', 'Value': '1'},
{'Name': 'ProcessingInstanceType', 'Value': 'ml.t3.medium'},
{'Name': 'MseThreshold', 'Value': '6'},
{'Name': 'ModelPath',
'Value': 's3://sagemaker-us-east-1-123123123123/Abalone/models/'},
{'Name': 'XGBMaxDepth', 'Value': '12'},
{'Name': 'TrainingInstanceType', 'Value': 'ml.c5.xlarge'},
{'Name': 'InputData',
'Value': 's3://sagemaker-us-east-1-123123123123/sample-dataset/abalone/abalone.csv'}]
Use cases
In this section, we present a few scenarios where Selective Execution can potentially save time and resources. We use a typical pipeline flow, which includes steps such as data extraction, training, evaluation, model registration and deployment, as a reference to demonstrate the advantages of Selective Execution.
SageMaker Pipelines allows you to define runtime parameters for your pipeline run using pipeline parameters. When a new run is triggered, it typically runs the entire pipeline from start to finish. However, if step caching is enabled, SageMaker Pipelines will attempt to find a previous run of the current pipeline step with the same attribute values. If a match is found, SageMaker Pipelines will use the outputs from the previous run instead of recomputing the step. Note that even with step caching enabled, SageMaker Pipelines will still run the entire workflow to the end by default.
With the release of the Selective Execution feature, you can now rerun an entire pipeline workflow or selectively run a subset of steps using a prior pipeline ARN. This can be done even without step caching enabled. The following use cases illustrate the various ways you can use Selective Execution.
Use case 1: Run a single step
Data scientists often focus on the training stage of a MLOps pipeline and don’t want to worry about the preprocessing or deployment steps. Selective Execution allows data scientists to focus on just the training step and modify training parameters or hyperparameters on the fly to improve the model. This can save time and reduce cost because compute resources are only utilized for running user-selected pipeline steps. See the following code:
# select a reference pipeline arn and subset step to execute
selective_execution_config = SelectiveExecutionConfig(
source_pipeline_execution_arn="arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/9e3ljoql7s0n",
selected_steps=["Abalone-Train"]
)
# start execution of pipeline subset
select_execution = pipeline.start(
selective_execution_config=selective_execution_config,
parameters={
"XGBNumRounds": 120,
"XGBSubSample": 0.9,
"XGBGamma": 2,
"XGBMinChildWeight": 4,
"XGBETA": 0.25,
"XGBMaxDepth": 12
}
)
The following figures illustrate the pipeline with one step in process and then complete.
Use case 2: Run multiple contiguous pipeline steps
Continuing with the previous use case, a data scientist wants to train a new model and evaluate its performance against a golden test dataset. This evaluation is crucial to ensure that the model meets rigorous guidelines for user acceptance testing (UAT) or production deployment. However, the data scientist doesn’t want to run the entire pipeline workflow or deploy the model. They can use Selective Execution to focus solely on the training and evaluation steps, saving time and resources while still getting the validation results they need:
# select a reference pipeline arn and subset step to execute
selective_execution_config = SelectiveExecutionConfig(
source_pipeline_execution_arn="arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/9e3ljoql7s0n",
selected_steps=["Abalone-Train", "Abalone-Evaluate"]
)
# start execution of pipeline subset
select_execution = pipeline.start(
selective_execution_config=selective_execution_config,
parameters={
"ProcessingInstanceType": "ml.t3.medium",
"XGBNumRounds": 120,
"XGBSubSample": 0.9,
"XGBGamma": 2,
"XGBMinChildWeight": 4,
"XGBETA": 0.25,
"XGBMaxDepth": 12
}
)
Use case 3: Update and rerun failed pipeline steps
You can use Selective Execution to rerun failed steps within a pipeline or resume the run of a pipeline from a failed step onwards. This can be useful for troubleshooting and debugging failed steps because it allows developers to focus on the specific issues that need to be addressed. This can lead to more efficient problem-solving and faster iteration times. The following example illustrates how you can choose to rerun just the failed step of a pipeline.
# select a previously failed pipeline arn
selective_execution_config = SelectiveExecutionConfig(
source_pipeline_execution_arn="arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/fvsmu7m7ki3q",
selected_steps=["Abalone-Evaluate"]
)
# start execution of failed pipeline subset
select_execution = pipeline.start(
selective_execution_config=selective_execution_config
)
Alternatively, a data scientist can resume a pipeline from a failed step to the end of the workflow by specifying the failed step and all the steps that follow it in the SelectiveExecutionConfig
.
Use case 4: Pipeline coverage
In some pipelines, certain branches are less frequently run than others. For example, there might be a branch that only runs when a specific condition fails. It’s important to test these branches thoroughly to ensure that they work as expected when a failure does occur. By testing these less frequently run branches, developers can verify that their pipeline is robust and that error-handling mechanisms effectively maintain the desired workflow and produce reliable results.
selective_execution_config = SelectiveExecutionConfig(
source_pipeline_execution_arn="arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/9e3ljoql7s0n",
selected_steps=["Abalone-Train", "Abalone-Evaluate", "Abalone-MSECheck", "Abalone-FailNotify"]
)
Conclusion
In this post, we discussed the Selective Execution feature of SageMaker Pipelines, which empowers you to selectively run specific steps of your ML workflows. This capability leads to significant time and computational resource savings. We provided some sample code in the GitHub repo that demonstrates how to use Selective Execution and presented various scenarios where it can be advantageous for users. If you would like to learn more about Selective Execution, refer to our Developer Guide and API Reference Guide.
To explore the available steps within the SageMaker Pipelines workflow in more detail, refer to Amazon SageMaker Model Building Pipeline and SageMaker Workflows. Additionally, you can find more examples showcasing different use cases and implementation approaches using SageMaker Pipelines in the AWS SageMaker Examples GitHub repository. These resources can further enhance your understanding and help you take advantage of the full potential of SageMaker Pipelines and Selective Execution in your current and future ML projects.
About the Authors
Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes. In his free time, he enjoys playing chess and traveling.
Akhil Numarsu is a Sr.Product Manager-Technical focused on helping teams accelerate ML outcomes through efficient tools and services in the cloud. He enjoys playing Table Tennis and is a sports fan.
Nishant Krishnamoorthy is a Sr. Software Development Engineer with Amazon Stores. He holds a masters degree in Computer Science and currently focuses on accelerating ML Adoption in different orgs within Amazon by building and operationalizing ML solutions on SageMaker.
Repairing interrupted questions makes voice agents more accessible
Learning to represent truncated sentences with semantic graphs improves models’ ability to infer missing content.Read More
Train self-supervised vision transformers on overhead imagery with Amazon SageMaker
This is a guest blog post co-written with Ben Veasey, Jeremy Anderson, Jordan Knight, and June Li from Travelers.
Satellite and aerial images provide insight into a wide range of problems, including precision agriculture, insurance risk assessment, urban development, and disaster response. Training machine learning (ML) models to interpret this data, however, is bottlenecked by costly and time-consuming human annotation efforts. One way to overcome this challenge is through self-supervised learning (SSL). By training on large amounts of unlabeled image data, self-supervised models learn image representations that can be transferred to downstream tasks, such as image classification or segmentation. This approach produces image representations that generalize well to unseen data and reduces the amount of labeled data required to build performant downstream models.
In this post, we demonstrate how to train self-supervised vision transformers on overhead imagery using Amazon SageMaker. Travelers collaborated with the Amazon Machine Learning Solutions Lab (now known as the Generative AI Innovation Center) to develop this framework to support and enhance aerial imagery model use cases. Our solution is based on the DINO algorithm and uses the SageMaker distributed data parallel library (SMDDP) to split the data over multiple GPU instances. When pre-training is complete, the DINO image representations can be transferred to a variety of downstream tasks. This initiative led to improved model performances within the Travelers Data & Analytics space.
Overview of solution
The two-step process for pre-training vision transformers and transferring them to supervised downstream tasks is shown in the following diagram.
In the following sections, we provide a walkthrough of the solution using satellite images from the BigEarthNet-S2 dataset. We build on the code provided in the DINO repository.
Prerequisites
Before getting started, you need access to a SageMaker notebook instance and an Amazon Simple Storage Service (Amazon S3) bucket.
Prepare the BigEarthNet-S2 dataset
BigEarthNet-S2 is a benchmark archive that contains 590,325 multispectral images collected by the Sentinel-2 satellite. The images document the land cover, or physical surface features, of ten European countries between June 2017 and May 2018. The types of land cover in each image, such as pastures or forests, are annotated according to 19 labels. The following are a few example RGB images and their labels.
The first step in our workflow is to prepare the BigEarthNet-S2 dataset for DINO training and evaluation. We start by downloading the dataset from the terminal of our SageMaker notebook instance:
The dataset has a size of about 109 GB. Each image is stored in its own folder and contains 12 spectral channels. Three bands with 60m spatial resolution (60-meter pixel height/width) are designed to identify aerosols (B01), water vapor (B09), and clouds (B10). Six bands with 20m spatial resolution are used to identify vegetation (B05, B06, B07, B8A) and distinguish between snow, ice, and clouds (B11, B12). Three bands with 10m spatial resolution help capture visible and near-infrared light (B02, B03, B04, B8/B8A). Additionally, each folder contains a JSON file with the image metadata. A detailed description of the data is provided in the BigEarthNet Guide.
To perform statistical analyses of the data and load images during DINO training, we process the individual metadata files into a common geopandas Parquet file. This can be done using the BigEarthNet Common and the BigEarthNet GDF Builder helper packages:
The resulting metadata file contains the recommended image set, which excludes 71,042 images that are fully covered by seasonal snow, clouds, and cloud shadows. It also contains information on the acquisition date, location, land cover, and train, validation, and test split for each image.
We store the BigEarthNet-S2 images and metadata file in an S3 bucket. Because we use true color images during DINO training, we only upload the red (B04), green (B03), and blue (B02) bands:
The dataset is approximately 48 GB in size and has the following structure:
Train DINO models with SageMaker
Now that our dataset has been uploaded to Amazon S3, we move to train DINO models on BigEarthNet-S2. As shown in the following figure, the DINO algorithm passes different global and local crops of an input image to student and teacher networks. The student network is taught to match the output of the teacher network by minimizing the cross-entropy loss. The student and teacher weights are connected by an exponential moving average (EMA).
We make two modifications to the original DINO code. First, we create a custom PyTorch dataset class to load the BigEarthNet-S2 images. The code was initially written to process ImageNet data and expects images to be stored by class. BigEarthNet-S2, however, is a multi-label dataset where each image resides in its own subfolder. Our dataset class loads each image using the file path stored in the metadata:
This dataset class is called in main_dino.py
during training. Although the code includes a function to one-hot encode the land cover labels, these labels are not used by the DINO algorithm.
The second change we make to the DINO code is to add support for SMDDP. We add the following code to the init_distributed_mode
function in the util.py
file:
With these adjustments, we are ready to train DINO models on BigEarthNet-S2 using SageMaker. To train on multiple GPUs or instances, we create a SageMaker PyTorch Estimator that ingests the DINO training script, the image and metadata file paths, and the training hyperparameters:
This code specifies that we will train a small vision transformer model (21 million parameters) with a patch size of 16 for 100 epochs. It is best practice to create a new checkpoint_s3_uri
for each training job in order to reduce the initial data download time. Because we are using SMDDP, we must train on an ml.p3.16xlarge, ml.p3dn.24xlarge, or ml.p4d.24xlarge instance. This is because SMDDP is only enabled for the largest multi-GPU instances. To train on smaller instance types without SMDDP, you will need to remove the distribution
and debugger_hook_config
arguments from the estimator.
After we have created the SageMaker PyTorch Estimator, we launch the training job by calling the fit
method. We specify the input training data using the Amazon S3 URIs for the BigEarthNet-S2 metadata and images:
SageMaker spins up the instance, copies the training script and dependencies, and begins DINO training. We can monitor the progress of the training job from our Jupyter notebook using the following commands:
We can also monitor instance metrics and view log files on the SageMaker console under Training jobs. In the following figures, we plot the GPU utilization and loss function for a DINO model trained on an ml.p3.16xlarge instance with a batch size of 128.
During training, the GPU utilization is 83% of the ml.p3.16xlarge capacity (8 NVIDIA Tesla V100 GPUs) and the VRAM usage is 85%. The loss function steadily decreases with each epoch, indicating that the outputs of the student and teacher networks are becoming more similar. In total, training takes about 11 hours.
Transfer learning to downstream tasks
Our trained DINO model can be transferred to downstream tasks like image classification or segmentation. In this section, we use the pre-trained DINO features to predict the land cover classes for images in the BigEarthNet-S2 dataset. As depicted in the following diagram, we train a multi-label linear classifier on top of frozen DINO features. In this example, the input image is associated with arable land and pasture land covers.
Most of the code for the linear classifier is already in place in the original DINO repository. We make a few adjustments for our specific task. As before, we use the custom BigEarthNet dataset to load images during training and evaluation. The labels for the images are one-hot encoded as 19-dimensional binary vectors. We use the binary cross-entropy for the loss function and compute the average precision to evaluate the performance of the model.
To train the classifier, we create a SageMaker PyTorch Estimator that runs the training script, eval_linear.py
. The training hyperparameters include the details of the DINO model architecture and the file path for the model checkpoint:
We start the training job using the fit
method, supplying the Amazon S3 locations of the BigEarthNet-S2 metadata and training images and the DINO model checkpoint:
When training is complete, we can perform inference on the BigEarthNet-S2 test set using SageMaker batch transform or SageMaker Processing. In the following table, we compare the average precision of the linear model on test set images using two different DINO image representations. The first model, ViT-S/16 (ImageNet), is the small vision transformer checkpoint included in the DINO repository that was pre-trained using front-facing images in the ImageNet dataset. The second model, ViT-S/16 (BigEarthNet-S2), is the model we produced by pre-training on overhead imagery.
Model | Average precision |
---|---|
ViT-S/16 (ImageNet) | 0.685 |
ViT-S/16 (BigEarthNet-S2) | 0.732 |
We find that the DINO model pre-trained on BigEarthNet-S2 transfers better to the land cover classification task than the DINO model pre-trained on ImageNet, resulting in a 6.7% increase in the average precision.
Clean up
After completing DINO training and transfer learning, we can clean up our resources to avoid incurring charges. We stop or delete our notebook instance and remove any unwanted data or model artifacts from Amazon S3.
Conclusion
This post demonstrated how to train DINO models on overhead imagery using SageMaker. We used SageMaker PyTorch Estimators and SMDDP in order to generate representations of BigEarthNet-S2 images without the need for explicit labels. We then transferred the DINO features to a downstream image classification task, which involved predicting the land cover class of BigEarthNet-S2 images. For this task, pre-training on satellite imagery yielded a 6.7% increase in average precision relative to pre-training on ImageNet.
You can use this solution as a template for training DINO models on large-scale, unlabeled aerial and satellite imagery datasets. To learn more about DINO and building models on SageMaker, check out the following resources:
- Emerging Properties in Self-Supervised Vision Transformers
- Use PyTorch with Amazon SageMaker
- SageMaker’s Data Parallelism Library
About the Authors
Ben Veasey is a Senior Associate Data Scientist at Travelers, working within the AI & Automation Accelerator team. With a deep understanding of innovative AI technologies, including computer vision, natural language processing, and generative AI, Ben is dedicated to accelerating the adoption of these technologies to optimize business processes and drive efficiency at Travelers.
Jeremy Anderson is a Director & Data Scientist at Travelers on the AI & Automation Accelerator team. He is interested in solving business problems with the latest AI and deep learning techniques including large language models, foundational imagery models, and generative AI. Prior to Travelers, Jeremy earned a PhD in Molecular Biophysics from the Johns Hopkins University and also studied evolutionary biochemistry. Outside of work you can find him running, woodworking, or rewilding his yard.
Jordan Knight is a Senior Data Scientist working for Travelers in the Business Insurance Analytics & Research Department. His passion is for solving challenging real-world computer vision problems and exploring new state-of-the-art methods to do so. He has a particular interest in the social impact of ML models and how we can continue to improve modeling processes to develop ML solutions that are equitable for all. Jordan graduated from MIT with a Master’s in Business Analytics. In his free time you can find him either rock climbing, hiking, or continuing to develop his somewhat rudimentary cooking skills.
June Li is a data scientist at Travelers’s Business Insurance’s Artificial Intelligence team, where she leads and coordinates work in the AI imagery portfolio. She is passionate about implementing innovative AI solutions that bring substantial value to the business partners and stakeholders. Her work has been integral in transforming complex business challenges into opportunities by leveraging cutting-edge AI technologies.
Sourav Bhabesh is a Senior Applied Scientist at the AWS Titan Labs, where he builds Foundational Model (FM) capabilities and features. His specialty is Natural Language Processing (NLP) and is passionate about deep learning. Outside of work he enjoys reading books and traveling.
Laura Kulowski is an Applied Scientist at Amazon’s Generative AI Innovation Center, where she works closely with customers to build generative AI solutions. In her free time, Laura enjoys exploring new places by bike.
Andrew Ang is a Sr. Machine Learning Engineer at AWS. In addition to helping customers build AI/ML solutions, he enjoys water sports, squash and watching travel & food vlogs.
Mehdi Noori is an Applied Science Manager at the Generative AI Innovation Center. With a passion for bridging technology and innovation, he assists AWS customers in unlocking the potential of generative AI, turning potential challenges into opportunities for rapid experimentation and innovation by focusing on scalable, measurable, and impactful uses of advanced AI technologies, and streamlining the path to production.
How Thomson Reuters developed Open Arena, an enterprise-grade large language model playground, in under 6 weeks
This post is cowritten by Shirsha Ray Chaudhuri, Harpreet Singh Baath, Rashmi B Pawar, and Palvika Bansal from Thomson Reuters.
Thomson Reuters (TR), a global content and technology-driven company, has been using artificial intelligence (AI) and machine learning (ML) in its professional information products for decades. Thomson Reuters Labs, the company’s dedicated innovation team, has been integral to its pioneering work in AI and natural language processing (NLP). A key milestone was the launch of Westlaw Is Natural (WIN) in 1992. This technology was one of the first of its kind, using NLP for more efficient and natural legal research. Fast forward to 2023, and Thomson Reuters continues to define the future of professionals through rapid innovation, creative solutions, and powerful technology.
The introduction of generative AI provides another opportunity for Thomson Reuters to work with customers and once again advance how they do their work, helping professionals draw insights and automate workflows, enabling them to focus their time where it matters most. While Thomson Reuters pushes the boundaries of what generative AI and other technologies could do for the modern professional, how is it using the power of this technology for its own teams?
Thomson Reuters is highly focused on driving awareness and understanding of AI among colleagues in every team and every business area. Starting from foundational principles of what is AI and how does ML work, it’s delivering a rolling program of company-wide AI awareness sessions, including webinars, training materials, and panel discussions. During these sessions, ideas on how AI could be used started to surface as colleagues considered how to use tools that helped them use AI for their day-to-day tasks as well as serve their customers.
In this post, we discuss how Thomson Reuters Labs created Open Arena, Thomson Reuters’s enterprise-wide large language model (LLM) playground that was developed in collaboration with AWS. The original concept came out of an AI/ML Hackathon supported by Simone Zucchet (AWS Solutions Architect) and Tim Precious (AWS Account Manager) and was developed into production using AWS services in under 6 weeks with support from AWS. AWS-managed services such as AWS Lambda, Amazon DynamoDB, and Amazon SageMaker, as well as the pre-built Hugging Face Deep Learning Containers (DLCs), contributed to the pace of innovation. Open Arena has helped unlock company-wide experimentation with generative AI in a safe and controlled environment.
Diving deeper, Open Arena is a web-based playground that allows users to experiment with a growing set of tools enabled with LLMs. This provides non-programmatic access for Thomson Reuters employees who don’t have a background in coding but want to explore the art of the possible with generative AI at TR. Open Arena has been developed to get quick answers from several sets of corpora, such as for customer support agents, solutions to get quick answers from websites, solutions to summarize and verify points in a document, and much more. The capabilities of Open Arena continue to grow as the experiences from employees across Thomson Reuters spur new ideas and as new trends emerge in the field of generative AI. This is all facilitated by the modular serverless AWS architecture that underpins the solution.
Envisioning the Open Arena
Thomson Reuters’s objective was clear: to build a safe, secure, user-friendly platform—an “open arena”—as an enterprise-wide playground. Here, internal teams could not only explore and test the various LLMs developed in-house and those from the open-source community such as with the AWS and Hugging Face partnership, but also discover unique use cases by merging the capabilities of LLMs with Thomson Reuters’s extensive company data. This kind of platform would enhance the ability of teams to generate innovative solutions, improving the products and services that Thomson Reuters could offer its clients.
The envisioned Open Arena platform would serve the diverse teams within Thomson Reuters globally, providing them with a playground to freely interact with LLMs. The ability to have this interaction in a controlled environment would allow teams to uncover new applications and methodologies that might not have been apparent in a less direct engagement with these complex models.
Building the Open Arena
Building the Open Arena was a multi-faceted process. We aimed to harness the capabilities of AWS’s serverless and ML services to craft a solution that would seamlessly enable Thomson Reuters employees to experiment with the latest LLMs. We saw the potential of these services not only to provide scalability and manageability but also to ensure cost-effectiveness.
Solution overview
From creating a robust environment for model deployment and fine-tuning to ensuring meticulous data management and providing a seamless user experience, TR needed each aspect to integrate with several AWS services. Open Arena’s architecture was designed to be comprehensive yet intuitive, balancing complexity with ease of use. The following diagram illustrates this architecture.
SageMaker served as the backbone, facilitating model deployment as SageMaker endpoints and providing a robust environment for fine-tuning the models. We capitalized on the Hugging Face on SageMaker DLC offered by AWS to enhance our deployment process. In addition, we used the SageMaker Hugging Face Inference Toolkit and the Accelerate library to accelerate the inference process and effectively handle the demands of running complex and resource-intensive models. These comprehensive tools were instrumental in ensuring the fast and seamless deployment of our LLMs. Lambda functions, triggered by Amazon API Gateway, managed the APIs, ensuring meticulous preprocessing and postprocessing of the data.
In our quest to deliver a seamless user experience, we adopted a secure API Gateway to connect the front end hosted in Amazon Simple Storage Service (Amazon S3) to the Lambda backend. We deployed the front end as a static site on an S3 bucket, ensuring user authentication with the help of Amazon CloudFront and our company’s single sign-on mechanism.
Open Arena has been designed to integrate seamlessly with multiple LLMs through REST APIs. This ensured that the platform was flexible enough to react and integrate quickly as new state-of-the art-models were developed and released in the fast-paced generative AI space. From its inception, Open Arena was architected to provide a safe and secure enterprise AI/ML playground, so Thomson Reuters employees can experiment with any state-of-the-art LLM as quickly as they are released. Using Hugging Face models on SageMaker allowed the team to fine-tune models in a secure environment because all data is encrypted and doesn’t leave the virtual private cloud (VPC), ensuring that data remains private and confidential.
DynamoDB, our chosen NoSQL database service, efficiently stored and managed a wide variety of data, including user queries, responses, response times, and user data. To streamline the development and deployment process, we employed AWS CodeBuild and AWS CodePipeline for continuous integration and continuous delivery (CI/CD). Monitoring the infrastructure and ensuring its optimal functioning was made possible with Amazon CloudWatch, which provided custom dashboards and comprehensive logging capabilities.
Model development and integration
The heart of Open Arena is its diverse assortment of LLMs, which comprise both open-source and in-house developed models. These models have been fine-tuned to provide responses following specific user prompts.
We have experimented with different LLMs for different use cases in Open Arena, including Flan-T5-XL, Open Assistant, MPT, Falcon, and fine-tuned Flan-T5-XL on available open-source datasets using the parameter efficient fine-tuning technique. We used bitsandbytes integration from Hugging Face to experiment with various quantization techniques. This allowed us to optimize our LLMs for enhanced performance and efficiency, paving the way for even greater innovation. While selecting a model as a backend behind these use cases, we considered different aspects, like what does the performance of these models look like on NLP tasks that are of relevance to Thomson Reuters. Furthermore, we needed to consider engineering aspects, such as the following:
- Increased efficiency when building applications with LLMs – Quickly integrating and deploying state-of-the-art LLMs into our applications and workloads that run on AWS, using familiar controls and integrations with the depth and breadth of AWS
- Secure customization – Ensuring that all data used to fine-tune LLMs remains encrypted and does not leave the VPC
- Flexibility – The ability to choose from a wide selection of AWS native and open-source LLMs to find the right model for our varied use cases
We’ve been asking questions like is the higher cost of larger models justified by significant performance gains? Can these models handle long documents?
The following diagram illustrates our model architecture.
We have been evaluating these models on the preceding aspects on open-source legal datasets and Thomson Reuters internal datasets to assess them for specific use cases.
For content-based use cases (experiences that call for answers from specific corpus), we have a retrieval augmented generation (RAG) pipeline in place, which will fetch the most relevant content against the query. In such pipelines, documents are split into chunks and then embeddings are created and stored in OpenSearch. To get the best match documents or chunks, we use the retrieval/re-ranker approach based on bi-encoder and cross-encoder models. The retrieved best match is then passed as an input to the LLM along with the query to generate the best response.
The integration of Thomson Reuters’s internal content with the LLM experience has been instrumental in enabling users to extract more relevant and insightful results from these models. More importantly, it led to sparking ideas amongst every team for possibilities of adopting AI-enabled solutions in their business workflows.
Open Arena tiles: Facilitating user interaction
Open Arena adopts a user-friendly interface, designed with pre-set enabling tiles for each experience, as shown in the following screenshot. These tiles serve as pre-set interactions that cater to the specific requirements of the users.
For instance, the Experiment with Open Source LLM tile opens a chat-like interaction channel with open-source LLMs.
The Ask your Document tile allows users to upload documents and ask specific questions related to the content from the LLMs. The Experiment with Summarization tile enables users to distil large volumes of text into concise summaries, as shown in the following screenshot.
These tiles simplify the user consumption of AI-enabled work solutions and the navigation process within the platform, igniting creativity and fostering the discovery of innovative use cases.
The impact of the Open Arena
The launch of the Open Arena marked a significant milestone in Thomson Reuters’s journey towards fostering a culture of innovation and collaboration. The platform’s success was undeniable, with its benefits becoming rapidly evident across the company.
The Open Arena’s intuitive, chat-based design required no significant technical knowledge, making it accessible to different teams and different job roles across the globe. This ease of use boosted engagement levels, encouraging more users to explore the platform and unveiling innovative use cases.
In under a month, the Open Arena catered to over 1,000 monthly internal users from TR’s global footprint, averaging an interaction time of 5 minutes per user. With a goal to foster internal TR LLM experimentation and crowdsource creation of LLM use cases, Open Arena’s launch led to an influx of new use cases, effectively harnessing the power of LLMs combined with Thomson Reuters’s vast data resources.
Here’s what some of our users had to say about the Open Arena:
“Open Arena gives employees from all parts of the company a chance to experiment with LLMs in a practical, hands-on way. It’s one thing to read about AI tools, and another to use them yourself. This platform turbo-charges our AI learning efforts across Thomson Reuters.”
– Abby Pinto, Talent Development Solutions Lead, People Function
“OA (Open Arena) has enabled me to experiment with tricky news translation problems for the German Language Service of Reuters that conventional translation software can’t handle, and to do so in a safe environment where I can use our actual stories without fear of data leaks. The team behind OA has been incredibly responsive to suggestions for new features, which is the sort of service you can only dream of with other software.”
– Scot W. Stevenson, Senior Breaking News Correspondent for the German Language Service, Berlin, Germany
“When I used Open Arena, I got the idea to build a similar interface for our teams of customer support agents. This playground helped us reimagine the possibilities with GenAI.”
– Marcel Batista, Gerente de Servicos, Operations Customer Service & Support
“Open Arena powered by AWS serverless services, Amazon SageMaker, and Hugging Face helped us to quickly expose cutting-edge LLMs and generative AI tooling to our colleagues, which helped drive enterprise-wide innovation.”
– Shirsha Ray Chaudhuri, Director, Research Engineering, Thomson Reuters Labs
On a broader scale, the introduction of the Open Arena had a profound impact on the company. It not only increased AI awareness among employees but also stimulated a spirit of innovation and collaboration. The platform brought teams together to explore, experiment, and generate ideas, fostering an environment where groundbreaking concepts could be turned into reality.
Furthermore, the Open Arena has had a positive influence on Thomson Reuters AI services and products. The platform has served as a sandbox for AI, allowing teams to identify and refine AI applications before incorporating them into our offerings. Consequently, this has accelerated the development and enhancement of Thomson Reuters AI services, providing customers with solutions that are ever evolving and at the forefront of technological advancement.
Conclusion
In the fast-paced world of AI, it is crucial to continue advancing, and Thomson Reuters is committed to doing just that. The team behind the Open Arena is constantly working to add more features and enhance the platform’s capabilities, using AWS services like Amazon Bedrock and Amazon SageMaker Jumpstart, ensuring that it remains a valuable resource for our teams. As we move forward, we aim to keep pace with the rapidly evolving landscape of generative AI and LLMs. AWS provides the services needed for TR to keep pace with the constantly evolving generative AI field.
In addition to the ongoing development of the Open Arena platform, we are actively working on productionizing the multitude of use cases generated by the platform. This will allow us to provide our customers with even more advanced and efficient AI solutions, tailored to their specific needs. Furthermore, we will continue to foster a culture of innovation and collaboration, enabling our teams to explore new ideas and applications for AI technology.
As we embark on this exciting journey, we are confident that the Open Arena will play a pivotal role in driving innovation and collaboration across Thomson Reuters. By staying at the forefront of AI advancements, we will ensure that our products and services continue to evolve and meet the ever-changing demands of our customers.
About the Authors
Shirsha Ray Chaudhuri (Director, Research Engineering) heads the ML Engineering team in Bangalore for Thomson Reuters Labs, where she is leading the development and deployment of well-architected solutions in AWS and other cloud platforms for ML projects that drive efficiency and value for AI-driven features in Thomson Reuters products, platforms, and business systems. She works with communities on AI for good, societal impact projects and in the tech for D&I space. She loves to network with people who are using AI and modern tech for building a better world that is more inclusive, more digital, and together a better tomorrow.
Harpreet Singh Baath is a Senior Cloud and DevOps Engineer at Thomson Reuters Labs, where he helps research engineers and scientists develop machine learning solutions on cloud platforms. With over 6 years of experience, Harpreet’s expertise spans across cloud architectures, automation, containerization, enabling DevOps practices, and cost optimization. He is passionate about efficiency and cost-effectiveness, ensuring that cloud resources are utilized optimally.
Rashmi B Pawar is a Machine Learning Engineer at Thomson Reuters. She possesses considerable experience in productionizing models, establishing inference, and creating training pipelines tailored for various machine learning applications. Furthermore, she has significant expertise in incorporating machine learning workflows into existing systems and products.
Palvika Bansal is an Associate Applied Research Scientist at Thomson Reuters. She has worked on projects across diverse sectors to solve business problems for customers using AI/ML. She is highly passionate about her work and enthusiastic about taking on new challenges. Outside of work, she enjoys traveling, cooking, and reading.
Simone Zucchet is a Senior Solutions Architect at AWS. With close to a decade’s experience as a Cloud Architect, Simone enjoys working on innovative projects that help transform the way organizations approach business problems. He helps support large enterprise customers at AWS and is part of the Machine Learning TFC. Outside of his professional life, he enjoys working on cars and photography.
Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning with a special focus on natural language processing, large language models, and generative AI. Prior to this role, he was the Head of Data Science for Amazon’s EU Customer Service. Heiko helps our customers be successful in their AI/ML journey on AWS and has worked with organizations in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. In his spare time, Heiko travels as much as possible.
João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model training and inference optimization, and more broadly building large-scale ML platforms on AWS. He is also an active proponent of ML-specialized hardware and low-code ML solutions.
Georgios Schinas is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in London and works closely with customers in the UK and Ireland. Georgios helps customers design and deploy machine learning applications in production on AWS, with a particular interest in MLOps practices and enabling customers to perform machine learning at scale. In his spare time, he enjoys traveling, cooking, and spending time with friends and family.
Your phone camera can autofocus. Why can’t your specs?
Startup Pixieray is working on a breakthrough in vision correction.Read More
How Amazon Shopping uses Amazon Rekognition Content Moderation to review harmful images in product reviews
Customers are increasingly turning to product reviews to make informed decisions in their shopping journey, whether they’re purchasing everyday items like a kitchen towel or making major purchases like buying a car. These reviews have transformed into an essential source of information, enabling shoppers to access the opinions and experiences of other customers. As a result, product reviews have become a crucial aspect of any store, offering valuable feedback and insights to help inform purchase decisions.
Amazon has one of the largest stores with hundreds of millions of items available. In 2022, 125 million customers contributed nearly 1.5 billion reviews and ratings to Amazon stores, making online reviews at Amazon a solid source of feedback for customers. At the scale of product reviews submitted every month, it is essential to verify that these reviews align with Amazon Community Guidelines regarding acceptable language, words, videos, and images. This practice is in place to guarantee customers receive accurate information regarding the product, and to prevent reviews from including inappropriate language, offensive imagery, or any type of hate speech directed towards individuals or communities. By enforcing these guidelines, Amazon can maintain a safe and inclusive environment for all customers.
Content moderation automation allows Amazon to scale the process while keeping high accuracy. It’s a complex problem space with unique challenges and requiring different techniques for text, images, and videos. Images are a relevant component of product reviews, often providing a more immediate impact on customers than text. With Amazon Rekognition Content Moderation, Amazon is able to automatically detect harmful images in product reviews with higher accuracy, reducing reliance on human reviewers to moderate such content. Rekognition Content Moderation has helped to improve the well-being of human moderators and achieve significant cost savings.
Moderation with self-hosted ML models
The Amazon Shopping team designed and implemented a moderation system that uses machine learning (ML) in conjunction with human-in-the-loop (HITL) review to ensure product reviews are about the customer experience with the product and don’t contain inappropriate or harmful content as per the community guidelines. The image moderation subsystem, as illustrated in the following diagram, utilized multiple self-hosted and self-trained computer vision models to detect images that violate Amazon guidelines. The decision handler determines the moderation action and provides reasons for its decision based on the ML models’ output, thereby deciding whether the image required a further review by a human moderator or could be automatically approved or rejected.
With these self-hosted ML models, the team started by automating decisions on 40% of the images received as part of the reviews and continuously worked on improving the solution through the years while facing several challenges:
- Ongoing efforts to improve automation rate – The team desired to improve the accuracy of ML algorithms, aiming to increase the automation rate. This requires continuous investments in data labeling, data science, and MLOps for models training and deployment.
- System complexity – The architecture complexity requires investments in MLOps to ensure the ML inference process scales efficiently to meet the growing content submission traffic.
Replace self-hosted ML models with the Rekognition Content Moderation API
Amazon Rekognition is a managed artificial intelligence (AI) service that offers pre-trained models through an API interface for image and video moderation. It has been widely adopted by industries such as ecommerce, social media, gaming, online dating apps, and others to moderate user-generated content (UGC). This includes a range of content types, such as product reviews, user profiles, and social media post moderation.
Rekognition Content Moderation automates and streamlines image and video moderation workflows without requiring ML experience. Amazon Rekognition customers can process millions of images and videos, efficiently detecting inappropriate or unwanted content, with fully managed APIs and customizable moderation rules to keep users safe and the business compliant.
The team successfully migrated a subset of self-managed ML models in the image moderation system for nudity and not safe for work (NSFW) content detection to the Amazon Rekognition Detect Moderation API, taking advantage of the highly accurate and comprehensive pre-trained moderation models. With the high accuracy of Amazon Rekognition, the team has been able to automate more decisions, save costs, and simplify their system architecture.
Improved accuracy and expanded moderation categories
The implementation of the Amazon Rekognition image moderation API has resulted in higher accuracy for detection of inappropriate content. This implies that an additional approximate of 1 million images per year will be automatically moderated without the need for any human review.
Operational excellence
The Amazon Shopping team was able to simplify the system architecture, reducing the operational effort required to manage and maintain the system. This approach has saved them months of DevOps effort per year, which means they can now allocate their time to developing innovative features instead of spending it on operational tasks.
Cost reduction
The high accuracy from Rekognition Content Moderation has enabled the team to send fewer images for human review, including potentially inappropriate content. This has reduced the cost associated with human moderation and allowed moderators to focus their efforts on more high-value business tasks. Combined with the DevOps efficiency gains, the Amazon Shopping team achieved significant cost savings.
Conclusion
Migrating from self-hosted ML models to the Amazon Rekognition Moderation API for product review moderation can provide many benefits for businesses, including significant cost savings. By automating the moderation process, online stores can quickly and accurately moderate large volumes of product reviews, improving the customer experience by ensuring that inappropriate or spam content is quickly removed. Additionally, by using a managed service like the Amazon Rekognition Moderation API, companies can reduce the time and resources needed to develop and maintain their own models, which can be especially useful for businesses with limited technical resources. The API’s flexibility also allows online stores to customize their moderation rules and thresholds to fit their specific needs.
Learn more about content moderation on AWS and our content moderation ML use cases. Take the first step towards streamlining your content moderation operations with AWS.
About the Authors
Shipra Kanoria is a Principal Product Manager at AWS. She is passionate about helping customers solve their most complex problems with the power of machine learning and artificial intelligence. Before joining AWS, Shipra spent over 4 years at Amazon Alexa, where she launched many productivity-related features on the Alexa voice assistant.
Luca Agostino Rubino is a Principal Software Engineer in the Amazon Shopping team. He works on Community features like Customer Reviews and Q&As, focusing through the years on Content Moderation and on scaling and automation of Machine Learning solutions.
Lana Zhang is a Senior Solutions Architect at AWS WWSO AI Services team, specializing in AI and ML for Content Moderation, Computer Vision, Natural Language Processing and Generative AI. With her expertise, she is dedicated to promoting AWS AI/ML solutions and assisting customers in transforming their business solutions across diverse industries, including social media, gaming, e-commerce, media, advertising & marketing.
Intelligent video and audio Q&A with multilingual support using LLMs on Amazon SageMaker
Digital assets are vital visual representations of products, services, culture, and brand identity for businesses in an increasingly digital world. Digital assets, together with recorded user behavior, can facilitate customer engagement by offering interactive and personalized experiences, allowing companies to connect with their target audience on a deeper level. Efficiently discovering and searching for specific content within digital assets is crucial for businesses to optimize workflows, streamline collaboration, and deliver relevant content to the right audience. According to a study, by 2021, videos already make up 81% of all consumer internet traffic. This observation comes as no surprise because video and audio are powerful mediums offering more immersive experiences and naturally engages target audiences on a higher emotional level.
As companies accumulate large volumes of digital assets, it becomes more challenging to organize and manage them effectively to maximize their value. Traditionally, companies attach metadata, such as keywords, titles, and descriptions, to these digital assets to facilitate search and retrieval of relevant content. But this requires a well-designed digital asset management system and additional efforts to store these assets in the first place. In reality, most of the digital assets lack informative metadata that enables efficient content search. Additionally, you often need to do an analysis of different segments of the whole file and discover the concepts that are covered there. This is time consuming and requires a lot of manual effort.
Generative AI, particularly in the realm of natural language processing and understanding (NLP and NLU), has revolutionized the way we comprehend and analyze text, enabling us to gain deeper insights efficiently and at scale. The advancements in large language models (LLMs) have led to richer representations of texts, which provides better search capabilities for digital assets. Retrieval Augmented Generation (RAG), built on top of LLMs and advanced prompt techniques, is a popular approach to provide more accurate answers based on information hidden in the enterprise digital asset store. By taking advantage of embedding models of LLMs, and powerful indexers and retrievers, RAG can comprehend and process spoken or written queries and quickly find the most relevant information in the knowledge base. Previous studies have shown how RAG can be applied to provide a Q&A solution connecting with an enterprise’s private domain knowledge. However, among all types of digital assets, video and audio assets are the most common and important.
The RAG-based video/audio question answering solution can potentially solve business problems of locating training and reference materials that are in the form of non-text content. With limited tags or metadata associated of these assets, the solution is trying to make users interact with the chatbot and get answers to their queries, which could be links to specific video training (“I need link to Amazon S3 data storage training”) links to documents (“I need link to learn about machine learning”), or questions that were covered in the videos (“Tell me how to create an S3 bucket”). The response from the chatbot will be able to directly answer the question and also include the links to the source videos with the specific timestamp of the contents that are most relevant to the user’s request.
In this post, we demonstrate how to use the power of RAG in building a Q&A solution for video and audio assets on Amazon SageMaker.
Solution overview
The following diagram illustrates the solution architecture.
The workflow mainly consists of the following stages:
- Convert video to text with a speech-to-text model and text alignment with videos and organization. We store the data in Amazon Simple Storage Service (Amazon S3).
- Enable intelligent video search using a RAG approach with LLMs and LangChain. Users can get answers generated by LLMs and relevant sources with timestamps.
- Build a multi-functional chatbot using LLMs with SageMaker, where the two aforementioned solutions are wrapped and deployed.
For a detailed implementation, refer to the GitHub repo.
Prerequisites
You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, refer to create an AWS account.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you may need to request a service quota increase for the corresponding SageMaker processing and hosting instances. For preprocessing the video data, we use an ml.p3.2xlarge SageMaker processing instance. For hosting Falcon-40B, we use an ml.g5.12xlarge SageMaker hosting instance.
Convert video to text with a speech-to-text model and sentence embedding model
To be able to search through video or audio digital assets and provide contextual information from videos to LLMs, we need to convert all the media content to text and then follow the general approaches in NLP to process the text data. To make our solution more flexible to handle different scenarios, we provide the following options for this task:
- Amazon Transcribe and Amazon Translate – If each video and audio file only contains one language, we highly recommend that you choose Amazon Transcribe, which is an AWS managed service to transcribe audio and video files. If you need to translate them into the same language, Amazon Translate is another AWS managed service, which supports multilingual translation.
- Whisper – In real-world use cases, video data may include multiple languages, such as foreign language learning videos. Whisper is a multitasking speech recognition model that can perform multilingual speech recognition, speech translation, and language identification. You can use a Whisper model to detect and transcribe different languages on video data, and then translate all the different languages into one language. It’s important for most RAG solutions to run on the knowledge base with the same language. Even though OpenAI provides the Whisper API, for this post, we use the Whisper model from Hugging Face.
We run this task with an Amazon SageMaker Processing job on existing data. You can refer to data_preparation.ipynb
for the details of how to run this task.
Convert video data to audio data
Because Amazon Transcribe can handle both video and audio data and the Whisper model can only accept audio data, to make both options work, we need to convert video data to audio data. In the following code, we use VideoFileClip
from the library moviepy
to run this job:
from moviepy.editor import VideoFileClip
video = VideoFileClip(video_path)
video.audio.write_audiofile(audio_path)
Transcribe audio data
When the audio data is ready, we can choose from our two transcribing options. You can choose the optimal option based on your own use case with the criteria we mentioned earlier.
Option 1: Amazon Transcribe and Amazon Translate
The first option is to use Amazon AI services, such as Amazon Transcribe and Amazon Translate, to get the transcriptions of the video and audio datasets. You can refer to the following GitHub example when choosing this option.
Option 2: Whisper
A Whisper model can handle audio data up to 30 seconds in duration. To handle large audio data, we adopt transformers.pipeline
to run inference with Whisper. When searching relevant video clips or generating contents with RAG, timestamps for the relevant clips are the important references. Therefore, we turn return_timestamps
on to get outputs with timestamps. By setting the parameter language
in generate_kwargs
, all the different languages in one video file are transcribed and translated into the same language. stride_length_s
is the length of stride on the left and right of each chunk. With this parameter, we can make the Whisper model see more context when doing inference on each chunk, which will lead to a more accurate result. See the following code:
from transformers import pipeline
import torch
target_language = "en"
whisper_model = "whisper-large-v2"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model=f"openai/{whisper_model}",
device=device
)
generate_kwargs = {"task":"transcribe", "language":f"<|{target_language}|>"}
prediction = pipe(
file_path,
return_timestamps=True,
chunk_length_s=30,
stride_length_s=(5),
generate_kwargs=generate_kwargs
)
The output of pipe
is the dictionary format data with items of text
and chunks
. text
contains the entire transcribed result, and chunks
consists of chunks with the timestamp and corresponding transcribed result (see the following screenshot). We use data in chunks to do further processing.
As the preceding screenshot shows, lot of sentences have been cut off and split into different chunks. To make the chunks more meaningful, we need to combine sentences cut off and update timestamps in the next step.
Organize sentences
We use a very simple rule to combine sentences. When the chunk ends with a period (.
), we don’t make any change; otherwise, we concatenate it with the next chunk. The following code snippet explains how we make this change:
prev_chunk = None
new_chunks = []
for chunk in chunks:
if prev_chunk:
chunk['text'] = prev_chunk['text'] + chunk['text']
chunk['timestamp'] = (prev_chunk['timestamp'][0], chunk['timestamp'][1])
if not chunk['text'].endswith('.'):
prev_chunk = chunk
else:
new_chunks.append(chunk)
prev_chunk = None
Compared to the original chunks produced by the audio-to-text converts, we can get complete sentences that are cut off originally.
Chunk sentences
The text content in documents is normally organized by paragraph. Each paragraph focuses on the same topic. Chunking by paragraph may help embed texts into more meaningful vectors, which may improve retrieval accuracy.
Unlike the normal text content in documents, transcriptions from the transcription model are not paragraphed. Even though there are some stops in the audio files, sometimes it can’t be used to paragraph sentences. On the other hand, langchain
provides the recursive chunking text splitter function RecursiveCharacterTextSplitter
, which can keep all the semantically relevant content in the same chunk. Because we need to keep timestamps with chunks, we implement our own chunking process. Inspired by the post How to chunk text into paragraphs using python, we chunk sentences based on the similarity between the adjacent sentences with a sentence embedding approach. The basic idea is to take the sentences with the lowest similarity to adjacent sentences as the split points. We use all-MiniLM-L6-v2
for sentence embedding. You can refer the original post for the explanation of this approach. We have made some minor changes on the original source code; refer to our source code for the implementation. The core part for this process is as follows:
# Embed sentences
model_name = "all-minilm-l6-v2"
model = SentenceTransformer(model_name)
embeddings = model.encode(sentences_all)
# Create similarities matrix
similarities = cosine_similarity(embeddings)
# Let's apply our function. For long sentences i reccomend to use 10 or more sentences
minmimas = activate_similarities(similarities, p_size=p_size, order=order)
# Create empty string
split_points = [each for each in minmimas[0]]
text = ''
para_chunks = []
para_timestamp = []
start_timestamp = 0
for num, each in enumerate(sentences_all):
current_timestamp = timestamps_all[num]
if text == '' and (start_timestamp == current_timestamp[1]):
start_timestamp = current_timestamp[0]
if num in split_points:
para_chunks.append(text)
para_timestamp.append([start_timestamp, current_timestamp[1]])
text = f'{each}. '
start_timestamp = current_timestamp[1]
else:
text+=f'{each}. '
if len(text):
para_chunks.append(text)
para_timestamp.append([start_timestamp, timestamps_all[-1][1]])
To evaluate the efficiency of chunking with sentence embedding, we conducted qualitative comparisons between different chunking mechanisms. The assumption underlying such comparisons is that if the chunked texts are more semantically different and separate, there will be less irrelevant contextual information being retrieved for the Q&A, so that the answer will be more accurate and precise. At the same time, because less contextual information is sent to LLMs, the cost of inference will also be less as charges increment with the size of tokens.
We visualized the first two components of a PCA by reducing high dimension into two dimensions. Compared to recursive chunking, we can see the distances between vectors representing different chunks with sentence embedding are more scattered, meaning the chunks are more semantically separate. This means when the vector of a query is close to the vector of one chunk, it may have less possibility to be close to other chunks. A retrieval task will have fewer opportunities to choose relevant information from multiple semantically similar chunks.
When the chunking process is complete, we attach timestamps to the file name of each chunk, save it as a single file, and then upload it to an S3 bucket.
Enable intelligent video search using a RAG-based approach with LangChain
There are typically four approaches to build a RAG solution for Q&A with LangChain:
- Using the
load_qa_chain
functionality, which feeds all information to an LLM. This is not an ideal approach given the context window size and the volume of video and audio data. - Using the
RetrievalQA
tool, which requires a text splitter, text embedding model, and vector store to process texts and retrieve relevant information. - Using
VectorstoreIndexCreator
, which is a wrapper around all logic in the second approach. The text splitter, text embedding model, and vector store are configured together inside the function at one time. - Using the
ConversationalRetrievalChain
tool, which further adds memory of chat history to the QA solution.
For this post, we use the second approach to explicitly customize and choose the best engineering practices. In the following sections, we describe each step in detail.
To search for the relevant content based on the user input queries, we use semantic search, which can better understand the intent behind and query and perform meaningful retrieval. We first use a pre-trained embedding model to embed all the transcribed text into a vector space. At search time, the query is also embedded into the same vector space and the closest embeddings from the source corpus are found. You can deploy the pre-trained embedding model as shown in Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart to create the embeddings for semantic search. In our post, we adopt similar ways to create an intelligent video search solution using a RAG-based approach with the open-source LangChain library. LangChain is an open-source framework for developing applications powered by language models. LangChain provides a generic interface for many different LLMs.
We first deploy an embedding model GPT-J 6B provided by Amazon SageMaker JumpStart and the language model Falcon-40B Instruct from Hugging Face to prepare for the solution. When the endpoints are ready, we follow similar steps described Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart to create the LLM model and embedding model for LangChain.
The following code snippet shows how to create the LLM model using the langchain.llms.sagemaker_endpoint.SagemakerEndpoint
class and transform the request and response payload for the LLM in the ContentHandler
:
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
parameters = {
"max_new_tokens": 500,
}
class ContentHandler(LLMContentHandler):
content_type = "application/json"
accepts = "application/json"
def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
self.len_prompt = len(prompt)
input_str = json.dumps({"inputs": prompt , "parameters": {**model_kwargs}})
return input_str.encode("utf-8")
def transform_output(self, output: bytes) -> str:
response_json = output.read()
res = json.loads(response_json)
print(res)
ans = res[0]['generated_text'][self.len_prompt:]
return ans
content_handler = ContentHandler()
sm_llm = SagemakerEndpoint(
endpoint_name=_MODEL_CONFIG_["huggingface-falcon-40b"]["endpoint_name"],
region_name=aws_region,
model_kwargs=parameters,
content_handler=content_handler,
)
When we use a SageMaker JumpStart embedding model, we need to customize the LangChain SageMaker endpoint embedding class and transform the model request and response to integrate with LangChain. Load the processed video transcripts using the LangChain document loader and create an index.
We use the DirectoryLoader
package in LangChain to load the text documents into the document loader:
loader = DirectoryLoader("./data/demo-video-sagemaker-doc/", glob="*/.txt")
documents = loader.load()
Next, we use the embedding models to create the embeddings of the contents and store the embeddings in a FAISS vector store to create an index. We use this index to find relevant documents that are semantically similar to the input query. With the VectorstoreIndexCreator
class, you can just write a few lines of code to achieve this task:
index_creator = VectorstoreIndexCreator(
vectorstore_cls=FAISS,
embedding=embeddings,
text_splitter=CharacterTextSplitter(chunk_size=500, chunk_overlap=0),
)
index = index_creator.from_loaders([loader])
Now we can use the index to search for relevant context and pass it to the LLM model to generate an accurate response:
index.query(question=question, llm=sm_llm)
Build a multi-functional chatbot with SageMaker
With the deployed LLM on SageMaker, we can build a multi-functional smart chatbot to show how these models can help your business build advanced AI-powered applications. In this example, the chatbot uses Streamlit to build the UI and the LangChain framework to chain together different components around LLMs. With the help of the text-to-text and speech-to-text LLMs deployed on SageMaker, this smart chatbot accepts inputs from text files and audio files so users can chat with the input files (accepts text and audio files) and further build applications on top of this. The following diagram shows the architecture of the chatbot.
When a user uploads a text file to the chatbot, the chatbot puts the content into the LangChain memory component and the user can chat with the uploaded document. This part is inspired by the following GitHub example that builds a document chatbot with SageMaker. We also add an option to allow users to upload audio files. Then the chatbot automatically invokes the speech-to-text model hosted on the SageMaker endpoint to extract the text content from the uploaded audio file and add the text content to the LangChain memory. Lastly, we allow the user to select the option to use the knowledge base when answering questions. This is the RAG capability shown in the preceding diagram. We have defined the SageMaker endpoints that are deployed in the notebooks provided in the previous sections. Note that you need to pass the actual endpoint names that are shown in your account when running the Streamlit app. You can find the endpoint names on the SageMaker console under Inference and Endpoints.
Falcon_endpoint_name = os.getenv("falcon_ep_name", default="falcon-40b-instruct-12xl")
whisper_endpoint_name = os.getenv('wp_ep_name', default="whisper-large-v2")
embedding_endpoint_name = os.getenv('embed_ep_name', default="huggingface-textembedding-gpt-j-6b")
When the knowledge base option is not selected, we use the conversation chain, where we add the memory component using the ConversationBufferMemory provided by LangChain, so the bot can remember the current conversation history:
def load_chain():
memory = ConversationBufferMemory(return_messages=True)
chain = ConversationChain(llm=llm, memory=memory)
return chain
chatchain = load_chain()
We use similar logic as shown in the earlier section for the RAG component and add the document retrieval function to the code. For demo purposes, we load the transcribed text stored in SageMaker Studio local storage as a document source. You can implement other RAG solutions using the vector databases based on your choice, such as Amazon OpenSearch Service, Amazon RDS, Amazon Kendra, and more.
When users use the knowledge base for the question, the following code snippet retrieves the relevant contents from the database and provides additional context for the LLM to answer the question. We used the specific method provided by FAISS, similarity_search_with_score
, when searching for relevant documents. This is because it can also provide the metadata and similarity score of the retrieved source file. The returned distance score is L2 distance. Therefore, a lower score is better. This gives us more options to provide more context for the users, such as providing the exact timestamps of the source videos that are relevant to the input query. When the RAG option is selected by the user from the UI, the chatbot uses the load_qa_chain
function provided by LangChain to provide the answers based on the input prompt.
docs = docsearch.similarity_search_with_score(user_input)
contexts = []
for doc, score in docs:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
if score <= 0.9:
contexts.append(doc)
source.append(doc.metadata['source'].split('/')[-1])
print(f"n INPUT CONTEXT:{contexts}")
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.:nn{context}nnQuestion: {question}nHelpful Answer:"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain = load_qa_chain(llm=llm, prompt=PROMPT)
result = chain({"input_documents": contexts, "question": user_input},
return_only_outputs=True)["output_text"]
if len(source) != 0:
df = pd.DataFrame(source, columns=['knowledge source'])
st.data_editor(df)
Run the chatbot app
Now we’re ready to run the Streamlit app. Open a terminal in SageMaker Studio and navigate to the cloned GitHub repository folder. You need to install the required Python packages that are specified in the requirements.txt
file. Run pip install -r requirements.txt
to prepare the Python dependencies.
Then run the following command to update the endpoint names in the environment variables based on the endpoints deployed in your account accordingly. When you run the chatbot.py
file, it automatically updates the endpoint names based on the environment variables.
export falcon_ep_name=<the falcon endpoint name deployed in your account>
export wp_ep_name=<the whisper endpoint name deployed in your account>
export embed_ep_name=<the embedding endpoint name deployed in your account>
streamlit run app_chatbot/chatbot.py --server.port 6006 --server.maxUploadSize 6
To access the Streamlit UI, copy your SageMaker Studio URL and replace lab?
with proxy/[PORT NUMBER]/
. For this post, we specified the server port as 6006
, so the URL should look like https://<domain ID>.studio.<region>.sagemaker.aws/jupyter/default/proxy/6006/
.
Replace domain ID and region with the correct value in your account to access the UI.
Chat with your audio file
In the Conversation setup pane, choose Browse files to select local text or audio files to upload to the chatbot. If you select an audio file, it will automatically invoke the speech-to-text SageMaker endpoint to process the audio file and present the transcribed text to the console, as shown in the following screenshot. You can continue asking questions about the audio file and the chatbot will be able to remember the audio content and respond to your queries based on the audio content.
Use the knowledge base for the Q&A
When you want to answer questions that require specific domain knowledge or use the knowledge base, select Use knowledge base. This lets the chatbot retrieve relevant information from the knowledge base built earlier (the vector database) to add additional context to answer the question. For example, when we ask the question “what is the recommended way to first customize a foundation model?” to the chatbot without the knowledge base, the chatbot returns an answer similar to the following screenshot.
When we use the knowledge base to help answer this question, the chatbot returns a different response. In the demo video, we read the SageMaker document about how to customize a model in SageMaker Jumpstart.
The output also provides the original video file name with the retrieved timestamp of the corresponding text. Users can go back to the original video file and locate the specific clips in the original videos.
This example chatbot demonstrates how businesses can use various types of digital assets to enhance their knowledge base and provide multi-functional assistance to their employees to improve productivity and efficiency. You can build the knowledge database from documents, audio and video datasets, and even image datasets to consolidate all the resources together. With SageMaker serving as an advanced ML platform, you accelerate project ideation to production speed with the breadth and depth of the SageMaker services that cover the whole ML lifecycle.
Clean up
To save costs, delete all the resources you deployed as part of the post. You can follow the provided notebook’s cleanup section to programmatically delete the resources, or you can delete any SageMaker endpoints you may have created via the SageMaker console.
Conclusion
The advent of generative AI models powered by LLMs has revolutionized the way businesses acquire and apply insights from information. Within this context, digital assets, including video and audio content, play a pivotal role as visual representations of products, services, and brand identity. Efficiently searching and discovering specific content within these assets is vital for optimizing workflows, enhancing collaboration, and delivering tailored experiences to the intended audience. With the power of generative AI models on SageMaker, businesses can unlock the full potential of their video and audio resources. The integration of generative AI models empowers enterprises to build efficient and intelligent search solutions, enabling users to access relevant and contextual information from their digital assets, and thereby maximizing their value and fostering business success in the digital landscape.
For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS.
About the authors
Gordon Wang is a Senior AI/ML Specialist TAM at AWS. He supports strategic customers with AI/ML best practices across many industries. He is passionate about computer vision, NLP, generative AI, and MLOps. In his spare time, he loves running and hiking.
Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers build solutions using state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing ML solutions with best practices. In her spare time, she loves to explore nature and spend time with family and friends.
Guang Yang is a Senior Applied Scientist at the Amazon Generative AI Innovation Center, where he works with customers across various verticals and applies creative problem solving to generate value for customers with state-of-the-art generative AI solutions.
Harjyot Malik is a Senior Program Manager at AWS based in Sydney, Australia. He works with the APJC Enterprise Support teams and helps them build and deliver strategies. He collaborates with business teams, delving into complex problems to unearth innovative solutions that in return drive efficiencies for the business. In his spare time, he loves to travel and explore new places.