Amazon AWS – Page 146

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 3: Processing and Data Wrangler jobs

May 30, 2023

by Deepali Rajale Amazon AWS

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support plan. Since its introduction, we’ve helped hundreds of customers optimize their workloads, set guardrails, and improve the visibility of their machine learning (ML) workloads’ cost and usage.

In this series of posts, we share lessons learned about optimizing costs in Amazon SageMaker. In this post, we focus on data preprocessing using Amazon SageMaker Processing and Amazon SageMaker Data Wrangler jobs.

Data preprocessing holds a pivotal role in a data-centric AI approach. However, preparing raw data for ML training and evaluation is often a tedious and demanding task in terms of compute resources, time, and human effort. Data preparation commonly needs to be integrated from different sources and deal with missing or noisy values, outliers, and so on.

Furthermore, in addition to common extract, transform, and load (ETL) tasks, ML teams occasionally require more advanced capabilities like creating quick models to evaluate data and produce feature importance scores or post-training model evaluation as part of an MLOps pipeline.

SageMaker offers two features specifically designed to help with those issues: SageMaker Processing and Data Wrangler. SageMaker Processing enables you to easily run preprocessing, postprocessing, and model evaluation on a fully managed infrastructure. Data Wrangler reduces the time it takes to aggregate and prepare data by simplifying the process of data source integration and feature engineering using a single visual interface and a fully distributed data processing environment.

Both SageMaker features provide great flexibility with several options for I/O, storage, and computation. However, setting those options incorrectly may lead to unnecessary cost, especially when dealing with large datasets.

In this post, we analyze the pricing factors and provide cost optimization guidance for SageMaker Processing and Data Wrangler jobs.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage:

Part 1: Overview
Part 2: SageMaker notebooks and SageMaker Studio
Part 3: Processing and Data Wrangler jobs
Part 4: Training jobs
Part 5: Hosting

SageMaker Processing

SageMaker Processing is a managed solution to run data processing and model evaluation workloads. You can use it in data processing steps such as feature engineering, data validation, model evaluation, and model interpretation in ML workflows. With SageMaker Processing, you can bring your own custom processing scripts and choose to build a custom container or use a SageMaker managed container with common frameworks like scikit-learn, Lime, Spark and more.

SageMaker Processing charges you for the instance type you choose, based on the duration of use and provisioned storage that is attached to that instance. In Part 1, we showed how to get started using AWS Cost Explorer to identify cost optimization opportunities in SageMaker.

You can filter processing costs by applying a filter on the usage type. The names of these usage types are as follows:

REGION-Processing:instanceType (for example, USE1-Processing:ml.m5.large)
REGION-Processing:VolumeUsage.gp2 (for example, USE1-Processing:VolumeUsage.gp2)

To review your SageMaker Processing cost in Cost Explorer, start by filtering with SageMaker for Service, and for Usage type, you can select all processing instances running hours by entering the processing:ml prefix and selecting the list on the menu.

Avoid cost in processing and pipeline development

Before right-sizing and optimizing a SageMaker Processing job’s run duration, we check for high-level metrics about historic job runs. You can choose from two methods to do this.

First, you can access the Processing page on the SageMaker console.

Alternatively, you can use the list_processing_jobs API.

A Processing job status can be InProgress, Completed, Failed, Stopping, or Stopped.

A high number of failed jobs is common when developing new MLOps pipelines. However, you should always test and make every effort to validate jobs before launching them on SageMaker because there are charges for resources used. For that purpose, you can use SageMaker Processing in local mode. Local mode is a SageMaker SDK feature that allows you to create estimators, processors, and pipelines, and deploy them to your local development environment. This is a great way to test your scripts before running them in a SageMaker managed environment. Local mode is supported by SageMaker managed containers and the ones you supply yourself. To learn more about how to use local mode with Amazon SageMaker Pipelines, refer to Local Mode.

Optimize I/O-related cost

SageMaker Processing jobs offer access to three data sources as part of the managed processing input: Amazon Simple Storage Service (Amazon S3), Amazon Athena, and Amazon Redshift. For more information, refer to ProcessingS3Input, AthenaDatasetDefinition, and RedshiftDatasetDefinition, respectively.

Before looking into optimization, it’s important to note that although SageMaker Processing jobs support these data sources, they are not mandatory. In your processing code, you can implement any method for downloading the accessing data from any source (provided that the processing instance can access it).

To gain better insights into processing performance and detecting optimization opportunities, we recommend following logging best practices in your processing script. SageMaker publishes your processing logs to Amazon CloudWatch.

In the following example job log, we see that the script processing took 15 minutes (between Start custom script and End custom script).

However, on the SageMaker console, we see that the job took 4 additional minutes (almost 25% of the job’s total runtime).

This is due to the fact that in addition to the time our processing script took, SageMaker-managed data downloading and uploading also took time (4 minutes). If this proves to be a big part of the cost, consider alternate ways to speed up downloading time, such as using the Boto3 API with multiprocessing to download files concurrently, or using third-party libraries as WebDataset or s5cmd for faster download from Amazon S3. For more information, refer to Parallelizing S3 Workloads with s5cmd. Note that such methods might introduce charges in Amazon S3 due to data transfer.

Processing jobs also support Pipe mode. With this method, SageMaker streams input data from the source directly to your processing container into named pipes without using the ML storage volume, thereby eliminating the data download time and a smaller disk volume. However, this requires a more complicated programming model than simply reading from files on a disk.

As mentioned earlier, SageMaker Processing also supports Athena and Amazon Redshift as data sources. When setting up a Processing job with these sources, SageMaker automatically copies the data to Amazon S3, and the processing instance fetches the data from the Amazon S3 location. However, when the job is finished, there is no managed cleanup process and the data copied will still remain in Amazon S3 and might incur unwanted storage charges. Therefore, when using Athena and Amazon Redshift data sources, make sure to implement a cleanup procedure, such as a Lambda function that runs on a schedule or in a Lambda Step as part of a SageMaker pipeline.

Like downloading, uploading processing artifacts can also be an opportunity for optimization. When a Processing job’s output is configured using the ProcessingS3Output parameter, you can specify which S3UploadMode to use. The S3UploadMode parameter default value is EndOfJob, which will get SageMaker to upload the results after the job completes. However, if your Processing job produces multiple files, you can set S3UploadMode to Continuous, thereby enabling the upload of artifacts simultaneously as processing continues, and decreasing the job runtime.

Right-size processing job instances

Choosing the right instance type and size is a major factor in optimizing the cost of SageMaker Processing jobs. You can right-size an instance by migrating to a different version within the same instance family or by migrating to another instance family. When migrating within the same instance family, you only need to consider CPU/GPU and memory. For more information and general guidance on choosing the right processing resources, refer to Ensure efficient compute resources on Amazon SageMaker.

To fine-tune instance selection, we start by analyzing Processing job metrics in CloudWatch. For more information, refer to Monitor Amazon SageMaker with Amazon CloudWatch.

CloudWatch collects raw data from SageMaker and processes it into readable, near-real-time metrics. Although these statistics are kept for 15 months, the CloudWatch console limits the search to metrics that were updated in the last 2 weeks (this ensures that only current jobs are shown). Processing jobs metrics can be found in the /aws/sagemaker/ProcessingJobs namespace and the metrics collected are CPUUtilization, MemoryUtilization, GPUUtilization, GPUMemoryUtilization, and DiskUtilization.

The following screenshot shows an example in CloudWatch of the Processing job we saw earlier.

In this example, we see the averaged CPU and memory values (which is the default in CloudWatch): the average CPU usage is 0.04%, memory 1.84%, and disk usage 13.7%. In order to right-size, always consider the maximum CPU and memory usage (in this example, the maximum CPU utilization was 98% in the first 3 minutes). As a general rule, if your maximum CPU and memory usage is consistently less than 40%, you can safely cut the machine in half. For example, if you were using an ml.c5.4xlarge instance, you could move to an ml.c5.2xlarge, which could reduce your cost by 50%.

Data Wrangler jobs

Data Wrangler is a feature of Amazon SageMaker Studio that provides a repeatable and scalable solution for data exploration and processing. You use the Data Wrangler interface to interactively import, analyze, transform, and featurize your data. Those steps are captured in a recipe (a .flow file) that you can then use in a Data Wrangler job. This helps you reapply the same data transformations on your data and also scale to a distributed batch data processing job, either as part of an ML pipeline or independently.

For guidance on optimizing your Data Wrangler app in Studio, refer to Part 2 in this series.

In this section, we focus on optimizing Data Wrangler jobs.

Data Wrangler uses SageMaker Spark processing jobs with a Data Wrangler-managed container. This container runs the directions from the .flow file in the job. Like any processing jobs, Data Wrangler charges you for the instances you choose, based on the duration of use and provisioned storage that is attached to that instance.

In Cost Explorer, you can filter Data Wrangler jobs costs by applying a filter on the usage type. The names of these usage types are:

REGION-processing_DW:instanceType (for example, USE1-processing_DW:ml.m5.large)
REGION-processing_DW:VolumeUsage.gp2 (for example, USE1-processing_DW:VolumeUsage.gp2)

To view your Data Wrangler cost in Cost Explorer, filter the service to use SageMaker, and for Usage type, choose the processing_DW prefix and select the list on the menu. This will show you both instance usage (hours) and storage volume (GB) related costs. (If you want to see Studio Data Wrangler costs you can filter the usage type by the Studio_DW prefix.)

Right-size and schedule Data Wrangler job instances

At the moment, Data Wrangler supports only m5 instances with following instance sizes: ml.m5.4xlarge, ml.m5.12xlarge, and ml.m5.24xlarge. You can use the distributed job feature to fine-tune your job cost. For example, suppose you need to process a dataset that requires 350 GiB in RAM. The 4xlarge (128 GiB) and 12xlarge (256 GiB) might not be able to process and will lead you to use the m5.24xlarge instance (768 GiB). However, you could use two m5.12xlarge instances (2 * 256 GiB = 512 GiB) and reduce the cost by 40% or three m5.4xlarge instances (3 * 128 GiB = 384 GiB) and save 50% of the m5.24xlarge instance cost. You should note that these are estimates and that distributed processing might introduce some overhead that will affect the overall runtime.

When changing the instance type, make sure you update the Spark config accordingly. For example, if you have an initial ml.m5.4xlarge instance job configured with properties spark.driver.memory set to 2048 and spark.executor.memory set to 55742, and later scale up to ml.m5.12xlarge, those configuration values need to be increased, otherwise they will be the bottleneck in the processing job. You can update these variables in the Data Wrangler GUI or in a configuration file appended to the config path (see the following examples).

Another compelling feature in Data Wrangler is the ability to set a scheduled job. If you’re processing data periodically, you can create a schedule to run the processing job automatically. For example, you can create a schedule that runs a processing job automatically when you get new data (for examples, see Export to Amazon S3 or Export to Amazon SageMaker Feature Store). However, you should note that when you create a schedule, Data Wrangler creates an eventRule in EventBridge. This means you also be charged for the event rules that you create (as well as the instances used to run the processing job). For more information, see Amazon EventBridge pricing.

Conclusion

In this post, we provided guidance on cost analysis and best practices when preprocessing

data using SageMaker Processing and Data Wrangler jobs. Similar to preprocessing, there are many options and configuration settings in building, training, and running ML models that may lead to unnecessary costs. Therefore, as machine learning establishes itself as a powerful tool across industries, ML workloads needs to remain cost-effective.

SageMaker offers a wide and deep feature set for facilitating each step in the ML pipeline.

This robustness also provides continuous cost optimization opportunities without compromising performance or agility.

Refer to the following posts in this series for more information about optimizing cost for SageMaker:

Part 1: Overview
Part 2: SageMaker notebooks and SageMaker Studio
Part 3: Processing and Data Wrangler jobs
Part 4: Training jobs
Part 5: Hosting

About the Authors

Deepali Rajale is a Senior AI/ML Specialist at AWS. She works with enterprise customers providing technical guidance with best practices for deploying and maintaining AI/ML solutions in the AWS ecosystem. She has worked with a wide range of organizations on various deep learning use cases involving NLP and computer vision. She is passionate about empowering organizations to leverage generative AI to enhance their use experience. In her spare time, she enjoys movies, music, and literature.

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers on all things ML to design, build, and operate at scale. In his spare time, he enjoys cycling, hiking, and watching sunsets (at minimum once a day).

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 2: SageMaker notebooks and Studio

May 30, 2023

by Deepali Rajale Amazon AWS

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support offering. Since its introduction, we have helped hundreds of customers optimize their workloads, set guardrails, and improve the visibility of their machine learning (ML) workloads’ cost and usage.

In this series of posts, we share lessons learned about optimizing costs in Amazon SageMaker. In Part 1, we showed how to get started using AWS Cost Explorer to identify cost optimization opportunities in SageMaker. In this post, we focus on various ways to analyze SageMaker usage and identify cost optimization opportunities for SageMaker notebook instances and Amazon SageMaker Studio.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage:

Part 1: Overview
Part 2: SageMaker notebooks and SageMaker Studio
Part 3: Processing and Data Wrangler jobs
Part 4: Training jobs
Part 5: Hosting

SageMaker notebook instances

A SageMaker notebook instance is a fully managed compute instance running the Jupyter Notebook app. SageMaker manages creating the instance and related resources. Notebooks contain everything needed to run or recreate an ML workflow. You can use Jupyter notebooks in your notebook instance to prepare and process data, write code to train models, deploy models to SageMaker Hosting, and test or validate your models. SageMaker notebook instances’ cost is based on the instance-hours consumed while the notebook instance is running, as well as the cost of GB-month of provisioned storage, as outlined in Amazon SageMaker Pricing.

In Cost Explorer, you can filter notebook costs by applying a filter on Usage type. The names of these usage types are structured as follows:

REGION-Notebk:instanceType (for example, USE1-Notebk:ml.g4dn.8xlarge)
REGION-Notebk:VolumeUsage.gp2 (for example, USE2-Notebk:VolumeUsage.gp2)

Filtering by the usage type Notebk: will show you a list of notebook usage types in an account. As shown in the following screenshot, you can select Select All and choose Apply to display the cost breakdown of your notebook usage.

To see the cost breakdown of the selected notebook usage type by the number of usage hours, you need to de-select all the REGION-Notebk:VolumeUsage.gp2 usage types from the preceding list and choose Apply to apply the filter. The following screenshot shows the cost and usage graphs for the selected notebook usage types.

You can also apply additional filters such as account number, Amazon Elastic Compute Cloud (Amazon EC2) instance type, cost allocation tag, Region, and more. Changing the granularity to Daily gives you daily cost and usage charts based on the selected usage types and dimension, as shown in the following screenshot.

In the preceding example, the notebook instance of type ml.t2.medium in the USE2 Region is reporting a daily usage of 24 hours between the period of July 2 and September 26. Similarly, the notebook instance of type ml.t3.medium in the USE1 Region is reporting a daily usage of 24 hours between August 3 and September 26, and a daily usage of 48 hours between September 26 and December 31. Daily usage of 24 hours or more for multiple consecutive days could indicate that a notebook instance has been left running for multiple days but is not in active use. This type of pattern could benefit from applying cost control guardrails such as manual or auto-shutdown of notebook instances to prevent idle runtime.

Although Cost Explorer helps you understand cost and usage data at the granularity of the instance type, you can use AWS Cost and Usage Reports (AWS CUR) to get data at the granularity of a resource such as notebook ARN. You can build custom queries to look up AWS CUR data using standard SQL. You can also include cost-allocation tags in your query for an additional level of granularity. The following query returns notebook resource usage for the last 3 months from your AWS CUR data:

SELECT
      bill_payer_account_id,
      line_item_usage_account_id,
      line_item_resource_id AS notebook_arn,
      line_item_usage_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d') AS day_line_item_usage_start_date,
      SUM(CAST(line_item_usage_amount AS DOUBLE)) AS sum_line_item_usage_amount,
      line_item_unblended_rate,
      SUM(CAST(line_item_unblended_cost AS DECIMAL(16,8))) AS sum_line_item_unblended_cost,
      line_item_blended_rate,
      SUM(CAST(line_item_blended_cost AS DECIMAL(16,8))) AS sum_line_item_blended_cost,
      line_item_line_item_description,
      line_item_line_item_type
    FROM 
      {$table_name}
    WHERE
      line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
      AND line_item_product_code = 'AmazonSageMaker'
      AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')
      AND line_item_usage_type like '%Notebk%'
        AND line_item_operation = 'RunInstance'
        AND bill_payer_account_id = 'xxxxxxxxxxxx'
    GROUP BY
      bill_payer_account_id, 
      line_item_usage_account_id,
      line_item_resource_id,
      line_item_usage_type,
      line_item_unblended_rate,
      line_item_blended_rate,
      line_item_line_item_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d'),
      line_item_line_item_description
      ORDER BY 
      line_item_resource_id, day_line_item_usage_start_date

The following screenshot shows the results obtained from running the AWS CUR query using Amazon Athena. For more information about using Athena, refer to Querying Cost and Usage Reports using Amazon Athena.

The result of the query shows that notebook dev-notebook running on an ml.t2.medium instance is reporting 24 hours of usage for multiple consecutive days. The instance rate is $0.0464/hour and the daily cost for running for 24 hours is $1.1136.

AWS CUR query results can help you identify patterns of notebooks running for consecutive days, which can be analyzed for cost optimization. More information and example queries can be found in the AWS CUR Query Library.

You can also feed AWS CUR data into Amazon QuickSight, where you can slice and dice it any way you’d like for reporting or visualization purposes. For instructions on ingesting AWS CUR data into QuickSight, see How do I ingest and visualize the AWS Cost and Usage Report (CUR) into Amazon QuickSight.

Optimize notebook instance cost

SageMaker notebooks are suitable for ML model development, which includes interactive data exploration, script writing, prototyping of feature engineering, and modeling. Each of these tasks may have varying computing resource requirements. Estimating the right type of computing resources to serve various workloads is challenging, and may lead to over-provisioning of resources, resulting in increased cost.

For ML model development, the size of a SageMaker notebook instance depends on the amount of data you need to load in-memory for meaningful exploratory data analyses (EDA) and the amount of computation required. We recommend starting small with general-purpose instances (such as T or M families) and scaling up as needed. For example, ml.t2.medium is sufficient for most basic data processing, feature engineering, and EDA that deals with small datasets that can be held within 4 GB memory. If your model development involves heavy computational work (such as image processing), you can stop your smaller notebook instance and change the instance type to the desired larger instance, such as ml.c5.xlarge. You can switch back to the smaller instance when you no longer need a larger instance. This will help keep the compute costs down.

Consider the following best practices to help reduce the cost of your notebook instances.

CPU vs. GPU

Considering CPU vs. GPU notebook instances is important for instance right-sizing. CPUs are best at handling single, more complex calculations sequentially, whereas GPUs are better at handling multiple but simple calculations in parallel. For many use cases, a standard current generation instance type from an instance family such as M provides enough computing power, memory, and network performance for notebooks to perform well.

GPUs provide a great price/performance ratio if you take advantage of them effectively. For example, if you are training your deep learning model on a SageMaker notebook and your neural network is relatively big, performing a large number of calculations involving hundreds of thousands of parameters, then your model can take advantage of the accelerated compute and hardware parallelism offered by GPU instances such as P instance families. However, it’s recommended to use GPU instances only when you really need them because they’re expensive and GPU communication overhead might even degrade performance if your notebook doesn’t need them. We recommend using notebooks with instances that are smaller in compute for interactive building and leaving the heavy lifting to ephemeral training, tuning, and processing jobs with larger instances, including GPU-enabled instances. This way, you don’t keep a large instance (or a GPU) constantly running with your notebook. If you need accelerated computing in your notebook environment, you can stop your m* family notebook instance, switch to a GPU-enabled P* family instance, and start it again. Don’t forget to switch it back when you no longer need that extra boost in your development environment.

Restrict user access to specific instance types

Administrators can restrict users from creating notebooks that are too large through AWS Identity and Access Management (IAM) policies. For example, the following sample policy only allows users to create smaller t3 SageMaker notebook instances:

{
    "Action": [
        "sagemaker:CreateNotebookInstances"
    ],
    "Resource": [
        "*"
    ],
    "Effect": "Deny",
    "Sid": "BlockLargeNotebookInstances",
    "Condition": {
        "ForAnyValue:StringNotLike": {
            "sagemaker:InstanceTypes": [
                "ml.t3.medium",
                "ml.t3.large"
            ]
        }
    }
}

Administrators can also use AWS Service Catalog to allow for self-service of SageMaker notebooks. This allows you to restrict the instance types that are available to users when creating a notebook. For more information, see Enable self-service, secured data science using Amazon SageMaker notebooks and AWS Service Catalog and Launch Amazon SageMaker Studio using AWS Service Catalog and AWS SSO in AWS Control Tower Environment.

Stop idle notebook instances

To keep your costs down, we recommend stopping your notebook instances when you don’t need them and starting them when you do need them. Consider auto-detecting idle notebook instances and managing their lifecycle using a lifecycle configuration script. For example, auto-stop-idle is a sample shell script that stops a SageMaker notebook when it’s idle for more than 1 hour.

AWS maintains a public repository of notebook lifecycle configuration scripts that address common use cases for customizing notebook instances, including a sample bash script for stopping idle notebooks.

Schedule automatic start and stop of notebook instances

Another approach to save on notebooks cost is to automatically start and stop your notebooks at specific times. You can accomplish this by using Amazon EventBridge rules and AWS Lambda functions. For more information about configuring your Lambda functions, see Configuring Lambda function options. After you have created the functions, you can create rules to trigger these functions on a specific schedule, for example, start the notebooks every weekday at 7:00 AM. See Creating an Amazon EventBridge rule that runs on a schedule for instructions. For the scripts to start and stop notebooks with a Lambda function, refer to Ensure efficient compute resources on Amazon SageMaker.

SageMaker Studio

Studio provides a fully managed solution for data scientists to interactively build, train, and deploy ML models. Studio notebooks are one-click collaborative Jupyter notebooks that can be spun up quickly because you don’t need to set up compute instances and file storage beforehand. You are charged for the compute instance type you choose to run your notebooks on, based on the duration of use. There is no additional charge for using Studio. The costs incurred for running Studio notebooks, interactive shells, consoles, and terminals are based on ML compute instance usage.

When launched, the resource is run on an ML compute instance of the chosen instance type. If an instance of that type was previously launched and is available, the resource is run on that instance. For CPU-based images, the default suggested instance type is ml.t3.medium. For GPU-based images, the default suggested instance type is ml.g4dn.xlarge. Billing occurs per instance and starts when the first instance of a given instance type is launched.

If you want to create or open a notebook without the risk of incurring charges, open the notebook from the File menu and choose No Kernel from the Select Kernel dialog. You can read and edit a notebook without a running kernel, but you can’t run cells. You are billed separately for each instance. Billing ends when all the KernelGateway apps on the instance are shut down, or the instance is shut down. For information about billing along with pricing examples, see Amazon SageMaker Pricing.

In Cost Explorer, you can filter Studio notebook costs by applying a filter on Usage type. The name of this usage types is structured as: REGION-studio:KernelGateway-instanceType (for example, USE1-Studio:KernelGateway-ml.m5.large)

Filtering by the usage type studio: in Cost Explorer will show you the list of Studio usage types in an account. You can select the necessary usage types, or select Select All and choose Apply to display the cost breakdown of Studio app usage. The following screenshot shows the selection all the studio usage types for cost analysis.

You can also apply additional filters such as Region, linked account, or instance type for more granular cost analysis. Changing the granularity to Daily gives you daily cost and usage charts based selected usage types and dimension, as shown in the following screenshot.

In the preceding example, the Studio KernelGateway instance of type ml.t3.medium in the USE1 Region is reporting a daily usage of 48 hours between the period of January 1 and January 24, followed by a daily usage of 24 hours until February 11. Similarly, Studio KernelGateway instance of type ml.m5.large in USE1 Region is reporting 24 hours of daily usage of between January 1 and January 23. A daily usage of 24 hours or more for multiple consecutive days indicates a possibility of Studio notebook instances running continuously for multiple days. This type of pattern could benefit from applying cost control guardrails such as manual or automatic shutdown of Studio apps when not in use.

As mentioned earlier, you can use AWS CUR to get data at the granularity of a resource and build custom queries to look up AWS CUR data using standard SQL. You can also include cost-allocation tags in your query for an additional level of granularity. The following query returns Studio KernelGateway resource usage for the last 3 months from your AWS CUR data:

SELECT
      bill_payer_account_id,
      line_item_usage_account_id,
      line_item_resource_id AS studio_notebook_arn,
      line_item_usage_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d') AS day_line_item_usage_start_date,
      SUM(CAST(line_item_usage_amount AS DOUBLE)) AS sum_line_item_usage_amount,
      line_item_unblended_rate,
      SUM(CAST(line_item_unblended_cost AS DECIMAL(16,8))) AS sum_line_item_unblended_cost,
      line_item_blended_rate,
      SUM(CAST(line_item_blended_cost AS DECIMAL(16,8))) AS sum_line_item_blended_cost,
      line_item_line_item_description,
      line_item_line_item_type
    FROM 
      customer_all
    WHERE
      line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
      AND line_item_product_code = 'AmazonSageMaker'
      AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')
      AND line_item_usage_type like '%Studio:KernelGateway%'
        AND line_item_operation = 'RunInstance'
        AND bill_payer_account_id = 'xxxxxxxxxxxx'
    GROUP BY
      bill_payer_account_id, 
      line_item_usage_account_id,
      line_item_resource_id,
      line_item_usage_type,
      line_item_unblended_rate,
      line_item_blended_rate,
      line_item_line_item_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d'),
      line_item_line_item_description
      ORDER BY 
      line_item_resource_id, day_line_item_usage_start_date

The following screenshot shows the results obtained from running the AWS CUR query using Athena.

The result of the query shows that the Studio KernelGateway app named datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395 running in account 111111111111, Studio domain d-domain1234, and user profile user1 on an ml.t3.medium instance is reporting 24 hours of usage for multiple consecutive days. The instance rate is $0.05/hour and the daily cost for running for 24 hours is $1.20.

AWS CUR query results can help you identify patterns of resources running for consecutive days at a granular level of hourly or daily usage, which can be analyzed for cost optimization. As with SageMaker notebooks, you can also feed AWS CUR data into QuickSight for reporting or visualization purposes.

SageMaker Data Wrangler

Amazon SageMaker Data Wrangler is a feature of Studio that helps you simplify the process of data preparation and feature engineering from a low-code visual interface. The usage type name for a Studio Data Wrangler app is structured as REGION-Studio_DW:KernelGateway-instanceType (for example, USE1-Studio_DW:KernelGateway-ml.m5.4xlarge).

Filtering by the usage type studio_DW: in Cost Explorer will show you the list of Studio Data Wrangler usage types in an account. You can select the necessary usage types, or select Select All and choose Apply to display the cost breakdown of Studio Data Wrangler app usage. The following screenshot shows the selection all the studio_DW usage types for cost analysis.

As noted earlier, you can also apply additional filters for more granular cost analysis. For example, the following screenshot shows 24 hours of daily usage of the Studio Data Wrangler instance type ml.m5.4xlarge in the USE1 Region for multiple days and its associated cost. Insights like this can be used to apply cost control measures such as shutting down Studio apps when not in use.

You can obtain resource-level information from AWS CUR, and build custom queries to look up AWS CUR data using standard SQL. The following query returns Studio Data Wrangler app resource usage and associated cost for the last 3 months from your AWS CUR data:

SELECT
      bill_payer_account_id,
      line_item_usage_account_id,
      line_item_resource_id AS studio_notebook_arn,
      line_item_usage_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d') AS day_line_item_usage_start_date,
      SUM(CAST(line_item_usage_amount AS DOUBLE)) AS sum_line_item_usage_amount,
      line_item_unblended_rate,
      SUM(CAST(line_item_unblended_cost AS DECIMAL(16,8))) AS sum_line_item_unblended_cost,
      line_item_blended_rate,
      SUM(CAST(line_item_blended_cost AS DECIMAL(16,8))) AS sum_line_item_blended_cost,
      line_item_line_item_description,
      line_item_line_item_type
    FROM 
      {$table_name}
    WHERE
      line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
      AND line_item_product_code = 'AmazonSageMaker'
      AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')
      AND line_item_usage_type like '%Studio_DW:KernelGateway%'
        AND line_item_operation = 'RunInstance'
        AND bill_payer_account_id = 'xxxxxxxxxxxx'
    GROUP BY
      bill_payer_account_id, 
      line_item_usage_account_id,
      line_item_resource_id,
      line_item_usage_type,
      line_item_unblended_rate,
      line_item_blended_rate,
      line_item_line_item_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d'),
      line_item_line_item_description
      ORDER BY 
      line_item_resource_id, day_line_item_usage_start_date

The following screenshot shows the results obtained from running the AWS CUR query using Athena.

The result of the query shows that the Studio Data Wrangler app named sagemaker-data-wrang-ml-m5-4xlarge-b741c1a025d542c78bb538373f2d running in account 111111111111, Studio domain d-domain1234, and user profile user1 on an ml.m5.4xlarge instance is reporting 24 hours of usage for multiple consecutive days. The instance rate is $0.922/hour and the daily cost for running for 24 hours is $22.128.

Optimize Studio cost

Studio notebooks are charged for the instance type you choose, based on the duration of use. You must shut down the instance to stop incurring charges. If you shut down the notebook running on the instance but don’t shut down the instance, you will still incur charges. When you shut down the Studio notebook instances, any additional resources, such as SageMaker endpoints, Amazon EMR clusters, and Amazon Simple Storage Service (Amazon S3) buckets created from Studio are not deleted. Delete those resources if they are no longer needed to stop accrual of charges. For more details about shutting down Studio resources, refer to Shut Down Resources. If you’re using Data Wrangler, it’s important to shut it down after your work is done to save cost. For details, refer to Shut Down Data Wrangler.

Consider the following best practices to help reduce the cost of your Studio notebooks.

Automatically stop idle Studio notebook instances

You can automatically stop idle Studio notebook resources with lifecycle configurations in Studio. You can also install and use a JupyterLab extension available on GitHub as a Studio lifecycle configuration. For detailed instructions on the Studio architecture and adding the extension, see Save costs by automatically shutting down idle resources within Amazon SageMaker Studio.

Resize on the fly

The benefit of Studio notebooks over notebook instances is that with Studio, the underlying compute resources are fully elastic and you can change the instance on the fly. This allows you to scale the compute up and down as your compute demand changes, for example from ml.t3.medium to ml.m5.4xlarge, without interrupting your work or managing infrastructure. Moving from one instance to another is seamless, and you can continue working while the instance launches. With on-demand notebook instances, you need to stop the instance, update the setting, and restart with the new instance type. For more information, see Learn how to select ML instances on the fly in Amazon SageMaker Studio.

Restrict user access to specific instance types

Administrators can use IAM condition keys as an effective way to restrict certain instance types, such as GPU instances, for specific users, thereby controlling costs. For example, in the following sample policy, access is denied for all instances except ml.t3.medium and ml.g4dn.xlarge. Note that you need to allow the system instance for the default Jupyter Server apps.

{
    "Action": [
        "sagemaker:CreateApp"
    ],
    "Resource": [
        "*"
    ],
    "Effect": "Deny",
    "Sid": "BlockSagemakerLargeInstances",
    "Condition": {
        "ForAnyValue:StringNotLike": {
            "sagemaker:InstanceTypes": [
                "ml.t3.medium",
                "ml.g4dn.xlarge",
                "system"
            ]
        }
    }
}

For comprehensive guidance on best practices to optimize Studio cost, refer to Ensure efficient compute resources on Amazon SageMaker.

Use tags to keep track of Studio cost

In Studio, you can assign custom tags to your Studio domain as well as users who are provisioned access to the domain. Studio will automatically copy and assign these tags to the Studio notebooks created by the users, so you can easily track and categorize the cost of Studio notebooks and create cost chargeback models for your organization.

By default, SageMaker automatically tags new SageMaker resources such as training jobs, processing jobs, experiments, pipelines, and model registry entries with their respective sagemaker:domain-arn. SageMaker also tags the resource with the sagemaker:user-profile-arn or sagemaker:space-arn to designate the resource creation at an even more granular level.

Administrators can use automated tagging to easily monitor costs associated with their line of business, teams, individual users, or individual business problems by using tools such as AWS Budgets and Cost Explorer. For example, you can attach a cost allocation tag for the sagemaker:domain-arn tag.

This allows you to utilize Cost Explorer to visualize the Studio notebook spend for a given domain.

Consider storage costs

When the first member of your team onboards to Studio, SageMaker creates an Amazon Elastic File System (Amazon EFS) volume for the team. When this member, or any member of the team, opens Studio, a home directory is created in the volume for the member. A storage charge is incurred for this directory. Subsequently, additional storage charges are incurred for the notebooks and data files stored in the member’s home directory. For more information, see Amazon EFS Pricing.

Conclusion

In this post, we provided guidance on cost analysis and best practices when building ML models using notebook instances and Studio. As machine learning establishes itself as a powerful tool across industries, training and running ML models needs to remain cost-effective. SageMaker offers a wide and deep feature set for facilitating each step in the ML pipeline and provides cost optimization opportunities without impacting performance or agility.

Refer to the following posts in this series for more information about optimizing cost for SageMaker:

Part 1: Overview
Part 2: SageMaker notebooks and SageMaker Studio
Part 3: Processing and Data Wrangler jobs
Part 4: Training jobs
Part 5: Hosting

About the Authors

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 1

May 30, 2023

by Deepali Rajale Amazon AWS

Cost optimization is one of the pillars of the AWS Well-Architected Framework, and it’s a continual process of refinement and improvement over the span of a workload’s lifecycle. It enables building and operating cost-aware systems that minimize costs, maximize return on investment, and achieve business outcomes.

Amazon SageMaker is a fully managed machine learning (ML) service that offers a variety of cost optimization options and capabilities like managed spot training, multi-model endpoints, AWS Inferentia, ML Savings Plans, and many others that help reduce the total cost of ownership (TCO) of ML workloads compared to other cloud-based options, such as self-managed Amazon Elastic Compute Cloud (Amazon EC2) and AWS-managed Amazon Elastic Kubernetes Service (Amazon EKS).

AWS is dedicated to helping you achieve the highest savings by offering extensive service and pricing options. We provide tools for flexible cost management and improved visibility of detailed cost and usage of your workloads.

In this post, we share lessons learned and walk you through the various ways to analyze your SageMaker usage and identify opportunities for cost optimization.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage:

Part 1: Overview
Part 2: SageMaker notebooks and SageMaker Studio
Part 3: Processing and Data Wrangler jobs
Part 4: Training jobs
Part 5: Hosting

Analyze SageMaker cost using AWS Cost Explorer

AWS Cost Explorer provides preconfigured views that display information about your cost trends and give you a head start on understanding your cost history and trends. It allows you to filter and group by values such as AWS service, usage type, cost allocation tags, EC2 instance type, and more. If you use consolidated billing, you can also filter by linked account. In addition, you can set time intervals and granularity, as well as forecast future costs based on your historical cost and usage data.

Let’s start by using Cost Explorer to identify cost optimization opportunities in SageMaker.

On the Cost Explorer console, choose SageMaker for Service and choose Apply filters.
You can set the desired time interval and granularity, as well as the Group by parameter.
You can display the chart data in bar, line, or stack plot format.
After you have achieved your desired results with filters and groupings, you can either download your results by choosing Download as CSV or save the report by choosing Save to report library.

The following screenshot shows SageMaker costs per month for the selected date range, grouped by Region.

For general guidance on using Cost Explorer, refer to AWS Cost Explorer’s New Look and Common Use Cases.

Optionally, you can enable AWS Cost and Usage Reports (AWS CUR) to gain insights into the cost and usage data for your accounts. The report contains hourly AWS consumption details. It is stored in Amazon Simple Storage Service (Amazon S3) in the payer account, which consolidates data for all the linked accounts. You can query the report to analyze trends in your usage and take appropriate action to optimize cost. Amazon Athena is a serverless query service you can use to analyze the data from your report in Amazon S3 using standard SQL. For more information and example queries, refer to the AWS CUR Query Library.

The following code is an example of an AWS CUR query to obtain SageMaker costs for the last 3 months of usage:

SELECT *
FROM {$table_name}
WHERE 
    line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
    AND line_item_product_code = 'AmazonSageMaker'
    AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')

You can also feed AWS CUR data into Amazon QuickSight, where you can slice and dice it any way you’d like for reporting or visualization purposes. For instructions on ingesting CUR data into QuickSight, see How do I ingest and visualize the AWS Cost and Usage Report (CUR) into Amazon QuickSight.

Analyze cost for SageMaker usage types

Your monthly SageMaker cost comes from different SageMaker usage types such as notebook instances, hosting, training, and processing, among others. Selecting the SageMaker service filter and grouping by the Usage type dimension in Cost Explorer gives you a general idea of cost distribution based on SageMaker usage type. The usage type is displayed in the format

REGION-UsageType:instanceType (for example, USE1-Notebk:ml.g4dn.8xlarge)

The following screenshot shows cost distribution grouped by SageMaker usage types when an account has reported usage on notebooks and Amazon SageMaker Studio KernelGateway apps.

General best practices for optimizing SageMaker cost

In this section, we share general recommendations to save on costs while using SageMaker.

Tagging

A tag is a label that you assign to an AWS resource. You can use tags to organize your resources by users, departments, or cost centers, and track your costs on a detailed level. Cost allocation tags can be used for categorizing costs in Cost Explorer or Cost and Usage Reports. For tips and best practices regarding cost allocation for your SageMaker environment and workloads, refer to Set up enterprise-level cost allocation for ML environments and workloads using resource tagging in Amazon SageMaker

AWS Budgets

AWS Budgets gives you visibility into your ML cost on AWS and helps you track your SageMaker cost, including development, training, and hosting. It lets you set custom budgets to track your cost and usage from the simplest to the most complex use cases. AWS Budgets also supports email or Amazon Simple Notification Service (Amazon SNS) notification when actual or forecasted cost and usage exceeds your budget threshold, or when your actual Savings Plans’ utilization or coverage drops below your desired threshold.

AWS Budgets is also integrated with Cost Explorer, so you can easily view and analyze your cost and usage drivers, AWS Chatbot, so you can receive AWS Budget alerts in your designated Slack channel or Amazon Chime room, and AWS Service Catalog, so you can track cost on your approved AWS portfolios and products. You can also set alerts and get a notification when your cost or usage exceeds (or is forecasted to exceed) your budgeted amount. After you create your budget, you can track the progress on the AWS Budgets console. For more information, see Managing your costs with AWS Budgets.

AWS Billing console

The AWS Billing console allows you to easily understand your AWS spending, view and pay invoices, manage billing preferences and tax settings, and access additional cloud financial management services. You can quickly evaluate whether your monthly spend is in line with prior periods, forecast, or budget, and investigate and take corrective actions in a timely manner. You can use the dashboard page of the AWS Billing console to gain a general view of your AWS spending. You can also use it to identify your highest cost service or Region and view trends in your spending over the past few months as well as to see various breakdowns of your AWS usage.

The AWS summary section of the page gives an overview of your AWS costs across all accounts, Regions, service providers, and services, and other KPIs. It also provides a comparison to your total forecasted costs for the current month. The Highest cost section shows your top service, account, or Region by estimated month-to-date (MTD) spend. The Cost trend by top five services section shows the cost trend for your top five services for the most recent three to six closed billing periods.

Planning and forecasting

Forecasting is an essential part of staying on top of your cloud costs and usage, and becomes even more important as your business scales.

AWS has multiple options to help you forecast your costs. The forecasting feature of Cost Explorer gives you the ability to create custom usage forecasts to gain a line of sight into your expected future costs. The built-in ML-powered forecasting of QuickSight allows you to forecast your key business metrics with point-and-click simplicity. It offers a straightforward way to use ML to make predictions on any time series data with minimal setup time and no ML experience required.

You can also use Amazon Forecast, a fully managed service that uses ML to deliver highly accurate forecasts, to generate forecasts for specific AWS services with data collected from AWS CUR. For more information, see Forecasting AWS spend using the AWS Cost and Usage Reports, AWS Glue DataBrew, and Amazon Forecast.

For additional information about cost forecasting options, see Using the right tools for your cloud cost forecasting.

Instance right-sizing

You can optimize SageMaker cost and only pay for what you really need by selecting the right resources. You should right-size the SageMaker compute instances before purchasing a Savings Plan in order to provide a proper commitment and obtain maximum cost savings. SageMaker currently offers ML compute instances on the various instance families. Machine learning is an iterative process with varying compute requirements for different stages of the ML lifecycle, from data preprocessing to model training and model hosting. Identifying the right type of compute instance is challenging, and may lead to over-provisioning of resources and therefore increased cost. The modular architecture of SageMaker allows you to optimize the scalability, performance, and pricing of your ML workloads based on the stage of the ML lifecycle. For more details, refer to the Right-sizing compute resources for Amazon SageMaker notebooks, processing jobs, training, and deployment section of the post Ensure efficient compute resources on Amazon SageMaker.

Amazon SageMaker Savings Plans

Amazon SageMaker Savings Plans is a flexible pricing model for SageMaker. It offers discounted rates in exchange for a commitment to a consistent amount of usage (measured in $/hour) for a 1-year or 3-year term. Savings Plans provide flexibility due to their usage-based model and help reduce your costs by up to 64%. These rates automatically apply to eligible SageMaker ML instance usages including Studio notebooks, SageMaker notebook instances, SageMaker Processing, SageMaker Data Wrangler, SageMaker training, SageMaker real-time inference, and SageMaker batch transform regardless of instance family, size, or Region. This makes it easy for you to maximize savings regardless of how your use cases and consumption evolve over time, and you can save up to 64% compared to the On-Demand price.

For example, you could start with small instances to experiment with different algorithms on a fraction of your dataset. Then, you could move to larger instances to prepare data and train at scale against your full dataset. Finally, you could deploy your models in several Regions to serve low-latency predictions to your users. All the instance size modifications and deployments across new Regions would be covered by the same Savings Plan, without any management effort required on your part.

Every type of SageMaker usage that is eligible for SageMaker Savings Plans has a Savings Plans rate and an On-Demand rate. When you sign up for the SageMaker Savings Plans, you will be charged the Savings Plan rate for your usage up to your commitment. Any usage beyond the commitment will be charged at On-Demand rates. The AWS Cost Management console provides you with recommendations that make it easy to find the right commitment level for a Savings Plan. These recommendations are based on the following:

Your SageMaker usage in the last 7, 30, or 60 days. You should select the time period that best represents your future usage.
The term of your plan: 1-year or 3-year.
Your payment option: No Upfront, Partial Upfront (50% or more), or All Upfront. Some customers prefer (or must use) this last option, because it gives them a clear and predictable view of their SageMaker bill.

The recommendations are based on your historical usage over the selected lookback period and don’t forecast your usage. Be sure to select a lookback period that reflects your future usage. A 3-year term plan provides the highest discount rate; similarly, an All Upfront payment option offers the highest discount rate compared to No Upfront or Partial Upfront payment options. Workloads and usage typically change over time and a consistent, steady-state usage pattern makes a good candidate for a savings plan. If you have a lot of short-lived or one-off workloads, selecting the right commitment for compute usage (measured per hour) could be difficult. It’s recommended to continually purchase small amounts of Savings Plans commitment over time. This ensures that you maintain high levels of coverage to maximize your discounts, and your plans closely match your workload and organization requirements at all times.

To understand Savings Plan recommendations, refer to Decrease Your Machine Learning Costs with Instance Price Reductions and Savings Plans for Amazon SageMaker.

Utilization report

For active Savings Plans, utilization reports are available on the Savings Plans console to see the percentage of the commitment that you’ve actually used. You can use your Savings Plans utilization report to visually understand how much of your Savings Plans commitment you are using over the configured time period, as well as your savings as compared to On-Demand prices. For example, if you have a $10/hour commitment, and your usage billed with Savings Plans rates totals to $9.80 for the hour, your utilization for that hour is 98%. You can see your Savings Plans utilization at an hourly, daily, or monthly granularity, based on your lookback period. You can apply filters by Savings Plans type, member account, Region, and instance family in the Filters section. If you’re a user in a management account, you can see the aggregated utilization for the entire Consolidated Billing Family.

The following screenshot shows an example of a utilization report. You can see that even though Savings Plans coverage is not 100% on many consecutive days, the total net savings is still positive. Without Savings Plans, you would be charged at On-Demand rates for the usage. To realize maximum savings and avoid over-committing, it’s recommended to select the right commitment based on consistent, optimized usage of your SageMaker workloads.

Coverage report

Likewise, coverage reports show you how much of your eligible spend has been covered by the plan. To understand how the coverage is calculated, refer to Using your coverage report.

The following screenshot shows an example of a coverage report. You can see that the average coverage for the selected time period is 92%, along with the On-Demand spend that was not covered by the plan. Based on the On-Demand spend not covered by the plan, you can optionally buy an additional Savings Plan to obtain maximum savings. Also, it’s recommended to right-size the SageMaker compute instances before purchasing a Savings Plan and understand the workload size to avoid over- or under-committing the Savings Plan usage.

For more details on how Savings Plans apply to your AWS usage, refer to Understanding how Savings Plans apply to your AWS usage.

Conclusion

Machine learning has established itself as a powerful tool across industries, but training new models and running ML models for inference can be costly. One of the advantages of running ML on SageMaker is the wide and deep feature set offering cost optimization strategies without impacting performance or agility. This post highlighted the AWS tools and options to analyze your SageMaker costs, identify trends, and implement proactive alerts and optimization best practices.

Refer to the following posts in this series for more information about optimizing cost for SageMaker:

Part 1: Overview
Part 2: SageMaker notebooks and SageMaker Studio
Part 3: Processing and Data Wrangler jobs
Part 4: Training jobs
Part 5: Hosting

About the Authors

High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus

May 30, 2023

by Jesse Manders Amazon AWS

Amazon SageMaker Ground Truth Plus helps you prepare high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements, and Ground Truth Plus sets up and manages your data labeling workflow based on these requirements. From there, an expert workforce that is trained on a variety of machine learning (ML) tasks labels your data. You don’t even need deep ML expertise or knowledge of workflow design and quality management to use Ground Truth Plus. Now, Ground Truth Plus is serving customers who need data labeling and human feedback for fine-tuning foundation models for generative AI applications.

In this post, you will learn about recent advancements in human feedback for generative AI available through SageMaker Ground Truth Plus. This includes new workflows and user interfaces (UIs) available for preparing demonstration datasets used in supervised fine-tuning, gathering high-quality human feedback to make preference datasets for aligning generative AI foundation models with human preferences, as well as customizing models to application builders’ requirements for style, substance, and voice.

Challenges of getting started with generative AI

Generative AI applications around the world incorporate both single-mode and multi-modal foundation models to solve for many different use cases. Common among them are chatbots, image generators, and video generators. Large language models (LLMs) are being used in chatbots for creative pursuits, academic and personal assistants, business intelligence tools, and productivity tools. You can use text-to-image models to generate abstract or realistic AI art and marketing assets. Text-to-video models are being used to generate videos for art projects, highly engaging advertisements, video game development, and even film development.

Two of the most important problems to solve for both model producers who create foundation models and application builders who use existing generative foundation models to build their own tools and applications are:

Fine-tuning these foundation models to be able to perform specific tasks
Aligning them with human preferences to ensure they output helpful, accurate, and harmless information

Foundation models are typically pre-trained on large corpora of unlabeled data, and therefore don’t perform well following natural language instructions. For an LLM, that means that they may be able to parse and generate language in general, but they may not be able to answer questions coherently or summarize text up to a user’s required quality. For example, when a user requests a summary of a text in a prompt, a model that hasn’t been fine-tuned how to summarize text may just recite the prompt text back to the user or respond with something irrelevant. If a user asks a question about a topic, the response from a model could just be a recitation of the question. For multi-modal models, such as text-to-image or text-to-video models, the models may output content unrelated to the prompt. For example, if a corporate graphic designer prompts a text-to-image model to create a new logo or an image for an advertisement, the model may not generate a relevant graphic related to the prompt if it has only a general concept of an image and elements of an image. In some cases, a model may output a harmful image or video, risking user confidence or company reputation.

Even if models are fine-tuned to perform specific tasks, they may not be aligned with human preferences with respect to the meaning, style, or substance of their output content. In an LLM, this could manifest itself as inaccurate or even harmful content being generated by the model. For example, a model that isn’t aligned with human preferences through fine-tuning may output dangerous, unethical, or even illegal instructions when prompted by a user. No care will have been taken to limit the content being generated by the model to ensure it is aligned with human preferences to be accurate, relevant, and useful. This misalignment can be a problem for companies that rely on generative AI models for their applications, such as chatbots and multimedia creation. For multi-modal models, this may take the form of toxic, dangerous, or abusive images or video being generated. This is a risk when prompts are input to the model without the intention of generating sensitive content, and also if the model producer or application builder had not intended to allow the model to generate that kind of content, but it was generated anyway.

To solve the issues of task-specific capability and aligning generative foundation models with human preferences, model producers and application builders must fine-tune the models with data using human-directed demonstrations and human feedback of model outputs.

Data and training types

There are several types of fine-tuning methods with different types of labeled data that are categorized as instruction tuning – or teaching a model how to follow instructions. Among them are supervised fine-tuning (SFT) using demonstration data, and reinforcement learning from human feedback (RLHF) using preference data.

Demonstration data for supervised fine-tuning

To fine-tune foundation models to perform specific tasks such as answering questions or summarizing text with high quality, the models undergo SFT with demonstration data. The purpose of demonstration data is to guide the model by providing it with labeled examples (demonstrations) of completed tasks being done by humans. For example, to teach an LLM how to answer questions, a human annotator will create a labeled dataset of human-generated question and answer pairs to demonstrate how a question and answer interaction works linguistically and what the content means semantically. This kind of SFT trains the model to recognize patterns of behavior demonstrated by the humans in the demonstration training data. Model producers need to do this type of fine-tuning to show that their models are capable of performing such tasks for downstream adopters. Application builders who use existing foundation models for their generative AI applications may need to fine-tune their models with demonstration data on these tasks with industry-specific or company-specific data to improve the relevancy and accuracy of their applications.

Preference data for instruction tuning such as RLHF

To further align foundation models with human preferences, model producers—and especially application builders—need to generate preference datasets to perform instruction tuning. Preference data in the context of instruction tuning is labeled data that captures human feedback with respect to a set of options output by a generative foundation model. It typically includes rating or ranking several inferences or pairwise comparing two inferences from a foundation model according to some specific attribute. For LLMs, these attributes may be helpfulness, accuracy, and harmlessness. For text-to-image models, it may be an aesthetic quality or text-image alignment. This preference data based on human feedback can then be used in various instruction tuning methods—including RLHF—in order to further fine-tune a model to align with human preferences.

Instruction tuning using preference data plays a crucial role in enhancing the personalization and effectiveness of foundation models. This is a key step in building custom applications on top of pre-trained foundation models and is a powerful method to ensure models are generating helpful, accurate, and harmless content. A common example of instruction tuning is to instruct a chatbot to generate three responses to a query, and have a human read and rank all three according to some specified dimension, such as toxicity, factual accuracy, or readability. For example, a company may use a chatbot for its marketing department and wants to make sure that content is aligned to its brand message, doesn’t exhibit biases, and is clearly readable. The company would prompt the chatbot during instruction tuning to produce three examples, and have their internal experts select the ones that most align to their goal. Over time, they build a dataset used to teach the model what style of content humans prefer through reinforcement learning. This enables the chatbot application to output more relevant, readable, and safe content.

SageMaker Ground Truth Plus

Ground Truth Plus helps you address both challenges—generating demonstration datasets with task-specific capabilities, as well as gathering preference datasets from human feedback to align models with human preferences. You can request projects for LLMs and multi-modal models such as text-to-image and text-to-video. For LLMs, key demonstration datasets include generating questions and answers (Q&A), text summarization, text generation, and text reworking for the purposes of content moderation, style change, or length change. Key LLM preference datasets include ranking and classifying text outputs. For multi-modal models, key task types include captioning images or videos as well as logging timestamps of events in videos. Therefore, Ground Truth Plus can help both model producers and application builders on their generative AI journey.

In this post, we dive deeper into the human annotator and feedback journey on four cases covering both demonstration data and preference data for both LLMs and multi-modal models: question and answer pair generation and text ranking for LLMs, as well as image captioning and video captioning for multi-modal models.

Large language models

In this section, we discuss question and answer pairs and text ranking for LLMs, along with customizations you may want for your use case.

Question and answer pairs

The following screenshot shows a labeling UI in which a human annotator will read a text passage and generate both questions and answers in the process of building a Q&A demonstration dataset.

Let’s walk through a tour of the UI in the annotator’s shoes. On the left side of the UI, the job requester’s specific instructions are presented to the annotator. In this case, the annotator is supposed to read the passage of text presented in the center of the UI and create questions and answers based on the text. On the right side, the questions and answers that the annotator has written are shown. The text passage as well as type, length, and number of questions and answers can all be customized by the job requester during the project setup with the Ground Truth Plus team. In this case, the annotator has created a question that requires understanding the whole text passage to answer and is marked with a References entire passage check box. The other two questions and answers are based on specific parts of the text passage, as shown by the annotator highlights with color-coded matching. Optionally, you may want to request that questions and answers are generated without a provided text passage, and provide other guidelines for human annotators—this is also supported by Ground Truth Plus.

After the questions and answers are submitted, they can flow to an optional quality control loop workflow where other human reviewers will confirm that customer-defined distribution and types of questions and answers have been created. If there is a mismatch between the customer requirements and what the human annotator has produced, the work will get funneled back to a human for rework before being exported as part of the dataset to deliver to the customer. When the dataset is delivered back to you, it’s ready to incorporate into the supervised fine-tuning workflow at your discretion.

Text ranking

The following screenshot shows a UI for ranking the outputs from an LLM based on a prompt.

You can simply write the instructions for the human reviewer, and bring prompts and pre-generated responses to the Ground Truth Plus project team to start the job. In this case, we have requested for a human reviewer to rank three responses per prompt from an LLM on the dimension of writing clarity (readability). Again, the left pane shows the instructions given to the reviewer by the job requester. In the center, the prompt is at the top of the page, and the three pre-generated responses are the main body for ease of use. On the right side of the UI, the human reviewer will rank them in order of most to least clear writing.

Customers wanting to generate this type of preference dataset include application builders interested in building human-like chatbots, and therefore want to customize the instructions for their own use. The length of the prompt, number of responses, and ranking dimension can all be customized. For example, you may want to rank five responses in order of most to least factually accurate, biased, or toxic, or even rank and classify multiple dimensions simultaneously. These customizations are supported in Ground Truth Plus.

Multi-modal models

In this section, we discuss image and video captioning for training multi-modal models such as text-to-image and text-to-video models, as well as customizations you may want to make for your particular use case.

Image captioning

The following screenshot shows a labeling UI for image captioning. You can request a project with image captioning to gather data to train a text-to-image model or an image-to-text model.

In this case, we have requested to train a text-to-image model and have set specific requirements on the caption in terms of length and detail. The UI is designed to walk the human annotators through the cognitive process of generating rich captions by providing a mental framework through assistive and descriptive tools. We have found that providing this mental framework for annotators results in more descriptive and accurate captions than simply providing an editable text box alone.

The first step in the framework is for the human annotator to identify key objects in the image. When the annotator chooses an object in the image, a color-coded dot appears on the object. In this case, the annotator has chosen both the dog and the cat, creating two editable fields on the right side of the UI wherein the annotator will enter the names of the objects—cat and dog—along with a detailed description of each object. Next, the annotator is guided to identify all the relationships between all the objects in the image. In this case, the cat is relaxing next to the dog. Next, the annotator is asked to identify specific attributes about the image, such as the setting, background, or environment. Finally, in the caption input text box, the annotator is instructed to combine all of what they wrote in the objects, relationships, and image setting fields into a complete single descriptive caption of the image.

Optionally, you can configure this image caption to be passed through a human-based quality check loop with specific instructions to ensure that the caption meets the requirements. If there is an issue identified, such as a missing key object, that caption can be sent back for a human to correct the issue before exporting as part of the training dataset.

Video captioning

The following screenshot shows a video captioning UI to generate rich video captions with timestamp tags. You can request a video caption project to gather data to build text-to-video or video-to-text models.

In this labeling UI, we have built a similar mental framework to ensure high-quality captions are written. The human annotator can control the video on the left side and create descriptions and timestamps for each activity shown in the video on the right side with the UI elements. Similar to the image captioning UI, there is also a place for the annotator to write a detailed description of the video setting, background, and environment. Finally, the annotator is instructed to combine all the elements into a coherent video caption.

Similar to the image caption case, the video captions may optionally flow through a human-based quality control workflow to determine if your requirements are met. If there is an issue with the video captions, it will be sent for rework by the human annotator workforce.

Conclusion

Ground Truth Plus can help you prepare high-quality datasets to fine-tune foundation models for generative AI tasks, from answering questions to generating images and videos. It also allows skilled human workforces to review model outputs to ensure that they are aligned with human preferences. Additionally, it enables application builders to customize models using their industry or company data to ensure their application represents their preferred voice and style. These are the first of many innovations in Ground Truth Plus, and more are in development. Stay tuned for future posts.

Interested in starting a project to build or improve your generative AI models and applications? Get started with Ground Truth Plus by connecting with our team today.

About the authors

Jesse Manders is a Senior Product Manager in the AWS AI/ML human in the loop services team. He works at the intersection of AI and human interaction with the goal of creating and improving AI/ML products and services to meet our needs. Previously, Jesse held leadership roles in engineering at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Romi Datta is a Senior Manager of Product Management in the Amazon SageMaker team responsible for Human in the Loop services. He has been in AWS for over 4 years, holding several product management leadership roles in SageMaker, S3 and IoT. Prior to AWS he worked in various product management, engineering and operational leadership roles at IBM, Texas Instruments and Nvidia. He has an M.S. and Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, and an MBA from the University of Chicago Booth School of Business.

Jonathan Buck is a Software Engineer at Amazon Web Services working at the intersection of machine learning and distributed systems. His work involves productionizing machine learning models and developing novel software applications powered by machine learning to put the latest capabilities in the hands of customers.

Alex Williams is an applied scientist in the human-in-the-loop science team at AWS AI where he conducts interactive systems research at the intersection of human-computer interaction (HCI) and machine learning. Before joining Amazon, he was a professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee where he co-directed the People, Agents, Interactions, and Systems (PAIRS) research laboratory. He has also held research positions at Microsoft Research, Mozilla Research, and the University of Oxford. He regularly publishes his work at premier publication venues for HCI, such as CHI, CSCW, and UIST. He holds a PhD from the University of Waterloo.

Sarah Gao is a Software Development Manager in Amazon SageMaker Human In the Loop (HIL) responsible for building the ML based labeling platform. Sarah has been in AWS for over 4 years, holding several software management leadership roles in EC2 security and SageMaker. Prior to AWS she worked in various engineering management roles at Oracle and Sun Microsystem.

Erran Li is the applied science manager at human-in-the-loop services, AWS AI, Amazon. His research interests are 3D deep learning, and vision and language representation learning. Previously he was a senior scientist at Alexa AI, the head of machine learning at Scale AI and the chief scientist at Pony.ai. Before that, he was with the perception team at Uber ATG and the machine learning platform team at Uber working on machine learning for autonomous driving, machine learning systems and strategic initiatives of AI. He started his career at Bell Labs and was adjunct professor at Columbia University. He co-taught tutorials at ICML’17 and ICCV’19, and co-organized several workshops at NeurIPS, ICML, CVPR, ICCV on machine learning for autonomous driving, 3D vision and robotics, machine learning systems and adversarial machine learning. He has a PhD in computer science at Cornell University. He is an ACM Fellow and IEEE Fellow.

Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker

May 26, 2023

by Simon Zamarin Amazon AWS

Text-to-image generation is a task in which a machine learning (ML) model generates an image from a textual description. The goal is to generate an image that closely matches the description, capturing the details and nuances of the text. This task is challenging because it requires the model to understand the semantics and syntax of the text and to generate photorealistic images. There are many practical applications of text-to-image generation in AI photography, concept art, building architecture, fashion, video games, graphic design, and much more.

Stable Diffusion is a text-to-image model that empowers you to create high-quality images within seconds. When real-time interaction with this type of model is the goal, ensuring a smooth user experience depends on the use of accelerated hardware for inference, such as GPUs or AWS Inferentia2, Amazon’s own ML inference accelerator. The steep costs involved in using GPUs typically requires optimizing the utilization of the underlying compute, even more so when you need to deploy different architectures or personalized (fine-tuned) models. Amazon SageMaker multi-model endpoints (MMEs) help you address this problem by helping you scale thousands of models into one endpoint. By using a shared serving container, you can host multiple models in a cost-effective, scalable manner within the same endpoint, and even the same GPU.

In this post, you will learn about Stable Diffusion model architectures, different types of Stable Diffusion models, and techniques to enhance image quality. We also show you how to deploy Stable Diffusion models cost-effectively using SageMaker MMEs and NVIDIA Triton Inference Server.


Prompt: portrait of a cute bernese dog, art by elke Vogelsang, 8k ultra realistic, trending on artstation, 4 k	Prompt: architecture design of living room, 8 k ultra-realistic, 4 k, hyperrealistic, focused, extreme details	Prompt: New York skyline at night, 8k, long shot photography, unreal engine 5, cinematic, masterpiece

Stable Diffusion architecture

Stable Diffusion is a text-to-image open-source model that you can use to create images of different styles and content simply by providing a text prompt. In the context of text-to-image generation, a diffusion model is a generative model that you can use to generate high-quality images from textual descriptions. Diffusion models are a type of generative model that can capture the complex dependencies between the input and output modalities text and images.

The following diagram shows a high-level architecture of a Stable Diffusion model.

It consists of the following key elements:

Text encoder – CLIP is a transformers-based text encoder model that takes input prompt text and converts it into token embeddings that represent each word in the text. CLIP is trained on a dataset of images and their captions, a combination of image encoder and text encoder.
U-Net – A U-Net model takes token embeddings from CLIP along with an array of noisy inputs and produces a denoised output. This happens though a series of iterative steps, where each step processes an input latent tensor and produces a new latent space tensor that better represents the input text.
Auto encoder-decoder – This model creates the final images. It takes the final denoised latent output from the U-Net model and converts it into images that represents the text input.

Types of Stable Diffusion models

In this post, we explore the following pre-trained Stable Diffusion models by Stability AI from the Hugging Face model hub.

stable-diffusion-2-1-base

Use this model to generate images based on a text prompt. This is a base version of the model that was trained on LAION-5B. The model was trained on a subset of the large-scale dataset LAION-5B, and mainly with English captions. We use StableDiffusionPipeline from the diffusers library to generate images from text prompts. This model can create images of dimension 512 x 512. It uses the following parameters:

prompt – A prompt can be a text word, phrase, sentences, or paragraphs.
negative_prompt – You can also pass a negative prompt to exclude specified elements from the image generation process and to enhance the quality of the generated images.
guidance_scale – A higher guidance scale results in an image more closely related to the prompt, at the expense of image quality. If specified, it must be a float.

stable-diffusion-2-depth

This model is used to generate new images from existing ones while preserving the shape and depth of the objects in the original image. This stable-diffusion-2-depth model is fine-tuned from stable-diffusion-2-base, an extra input channel to process the (relative) depth prediction. We use StableDiffusionDepth2ImgPipeline from the diffusers library to load the pipeline and generate depth images. The following are the additional parameters specific to the depth model:

image – The initial image to condition the generation of new images.
num_inference_steps (optional) – The number of denoising steps. More denoising steps usually leads to a higher-quality image at the expense of slower inference. This parameter is modulated by strength.
strength (optional) – Conceptually, this indicates how much to transform the reference image. The value must be between 0–1. image is used as a starting point, adding more noise to it the larger the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, the added noise will be maximum and the denoising process will run for the full number of iterations specified in num_inference_steps. A value of 1, therefore, essentially ignores image. For more details, refer to the following code.

stable-diffusion-2-inpainting

You can use this model for AI image restoration use cases. You can also use it to create novel designs and images from the prompts and additional arguments. This model is also derived from the base model and has a mask generation strategy. It specifies the mask of the original image to represent segments to be changed and segments to leave unchanged. We use StableDiffusionUpscalePipeline from the diffusers library to apply inpaint changes on original image. The following additional parameter is specific to the depth model:

mask_input – An image where the blacked-out portion remains unchanged during image generation and the white portion is replaced

stable-diffusion-x4-upscaler

This model is also derived from the base model, additionally trained on the 10M subset of LAION containing 2048 x 2048 images. As the name implies, it can be used to upscale lower-resolution images to higher resolutions

Use case overview

For this post, we deploy an AI image service with multiple capabilities, including generating novel images from text, changing the styles of existing images, removing unwanted objects from images, and upscaling low-resolution images to higher resolutions. Using several variations of Stable Diffusion models, you can address all of these use cases within a single SageMaker endpoint. This means that you’ll need to host large number of models in a performant, scalable, and cost-efficient way. In this post, we show how to deploy multiple Stable Diffusion models cost-effectively using SageMaker MMEs and NVIDIA Triton Inference Server. You will learn about the implementation details, optimization techniques, and best practices to work with text-to-image models.

The following table summarizes the Stable Diffusion models that we deploy to a SageMaker MME.

Model Name	Model Size in GB
`stabilityai/stable-diffusion-2-1-base`	2.5
`stabilityai/stable-diffusion-2-depth`	2.7
`stabilityai/stable-diffusion-2-inpainting`	2.5
`stabilityai/stable-diffusion-x4-upscaler`	7

Solution overview

The following steps are involved in deploying Stable Diffusion models to SageMaker MMEs:

Use the Hugging Face hub to download the Stable Diffusion models to a local directory. This will download scheduler, text_encoder, tokenizer, unet, and vae for each Stable Diffusion model into its corresponding local directory. We use the revision="fp16" version of the model.
Set up the NVIDIA Triton model repository, model configurations, and model serving logic model.py. Triton uses these artifacts to serve predictions.
Package the conda environment with additional dependencies and the package model repository to be deployed to the SageMaker MME.
Package the model artifacts in an NVIDIA Triton-specific format and upload model.tar.gz to Amazon Simple Storage Service (Amazon S3). The model will be used for generating images.
Configure a SageMaker model, endpoint configuration, and deploy the SageMaker MME.
Run inference and send prompts to the SageMaker endpoint to generate images using the Stable Diffusion model. We specify the TargetModel variable and invoke different Stable Diffusion models to compare the results visually.

We have published the code to implement this solution architecture in the GitHub repo. Follow the README instructions to get started.

Serve models with an NVIDIA Triton Inference Server Python backend

We use a Triton Python backend to deploy the Stable Diffusion pipeline model to a SageMaker MME. The Python backend lets you serve models written in Python by Triton Inference Server. To use the Python backend, you need to create a Python file model.py that has the following structure: Every Python backend can implement four main functions in the TritonPythonModel class:

import triton_python_backend_utils as pb_utils
class TritonPythonModel:
"""Your Python model must use the same class name. Every Python model
that is created must have "TritonPythonModel" as the class name.
"""
def auto_complete_config(auto_complete_model_config):
def initialize(self, args):
def execute(self, requests):
def finalize(self):

Every Python backend can implement four main functions in the TritonPythonModel class: auto_complete_config, initialize, execute, and finalize.

initialize is called when the model is being loaded. Implementing initialize is optional. initialize allows you to do any necessary initializations before running inference. In the initialize function, we create a pipeline and load the pipelines using from_pretrained checkpoints. We configure schedulers from the pipeline scheduler config pipe.scheduler.config. Finally, we specify xformers optimizations to enable the xformer memory efficient parameter enable_xformers_memory_efficient_attention. We provide more details on xformers later in this post. You can refer to model.py of each model to understand the different pipeline details. This file can be found in the model repository.

The execute function is called whenever an inference request is made. Every Python model must implement the execute function. In the execute function, you are given a list of InferenceRequest objects. We pass the input text prompt to the pipeline to get an image from the model. Images are decoded and the generated image is returned from this function call.

We get the input tensor from the name defined in the model configuration config.pbtxt file. From the inference request, we get prompt, negative_prompt, and gen_args, and decode them. We pass all the arguments to the model pipeline object. Encode the image to return the generated image predictions. You can refer to the config.pbtxt file of each model to understand the different pipeline details. This file can be found in the model repository. Finally, we wrap the generated image in InferenceResponse and return the response.

Implementing finalize is optional. This function allows you to do any cleanups necessary before the model is unloaded from Triton Inference Server.

When working with the Python backend, it’s the user’s responsibility to ensure that the inputs are processed in a batched manner and that responses are sent back accordingly. To achieve this, we recommend following these steps:

Loop through all requests in the requests object to form a batched_input.
Run inference on the batched_input.
Split the results into multiple InferenceResponse objects and concatenate them as the responses.

Refer to the Triton Python backend documentation or Host ML models on Amazon SageMaker using Triton: Python backend for more details.

NVIDIA Triton model repository and configuration

The model repository contains the model serving script, model artifacts and tokenizer artifacts, a packaged conda environment (with dependencies needed for inference), the Triton config file, and the Python script used for inference. The latter is mandatory when you use the Python backend, and you should use the Python file model.py. Let’s explore the configuration file of the inpaint Stable Diffusion model and understand the different options specified:

name: "sd_inpaint"
backend: "python"
max_batch_size: 8
input [
  {
    name: "prompt"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
  },
  {
    name: "negative_prompt"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
    optional: true
  },
  {
    name: "image"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
  },
  {
    name: "mask_image"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
  },
  {
    name: "gen_args"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
    optional: true
  }
]
output [
  {
    name: "generated_image"
    data_type: TYPE_STRING    
    dims: [
      -1
    ]
  }
]
instance_group [
  {
    kind: KIND_GPU
  }
]
parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/tmp/conda/sd_env.tar.gz"
  }
}

The following table explains the various parameters and values:

Key	Details
`name`	It’s not required to include the model configuration name property. In the event that the configuration doesn’t specify the model’s name, it’s presumed to be identical to the name of the model repository directory where the model is stored. However, if a name is provided, it must match the name of the model repository directory where the model is stored. `sd_inpaint` is the config property name.
`backend`	This specifies the Triton framework to serve model predictions. This is a mandatory parameter. We specify `python`, because we’ll be using the Triton Python backend to host the Stable Diffusion models.
`max_batch_size`	This indicates the maximum batch size that the model supports for the types of batching that can be exploited by Triton.
`input→ prompt`	Text prompt of type string. Specify -1 to accept dynamic tensor shape.
`input→ negative_prompt`	Negative text prompt of type string. Specify -1 to accept dynamic tensor shape.
`input→ mask_image`	Base64 encoded mask image of type string. Specify -1 to accept dynamic tensor shape.
`input→ image`	Base64 encoded image of type string. Specify -1 to accept dynamic tensor shape.
`input→ gen_args`	JSON encoded additional arguments of type string. Specify -1 to accept dynamic tensor shape.
`output→ generated_image`	Generated image of type string. Specify -1 to accept dynamic tensor shape.
`instance_group`	You can use this this setting to place multiple run instances of a model on every GPU or on only certain GPUs. We specify `KIND_GPU` to make copies of the model on available GPUs.
`parameters`	We set the conda environment path to `EXECUTION_ENV_PATH`.

For details about the model repository and configurations of other Stable Diffusion models, refer to the code in the GitHub repo. Each directory contains artifacts for the specific Stable Diffusion models.

Package a conda environment and extend the SageMaker Triton container

SageMaker NVIDIA Triton container images don’t contain libraries like transformer, accelerate, and diffusers to deploy and serve Stable Diffusion models. However, Triton allows you to bring additional dependencies using conda-pack. Let’s start by creating the conda environment with the necessary dependencies outlined in the environment.yml file and create a tar model artifact sd_env.tar.gz file containing the conda environment with dependencies installed in it. Run the following YML file to create a conda-pack artifact and copy the artifact to the local directory from where it will be uploaded to Amazon S3. Note that we will be uploading the conda artifacts as one of the models in the MME and invoking this model to set up the conda environment in the SageMaker hosting ML instance.

%%writefile environment.yml
name: mme_env
dependencies:
  - python=3.8
  - pip
  - pip:
      - numpy
      - torch --extra-index-url https://download.pytorch.org/whl/cu118
      - accelerate
      - transformers
      - diffusers
      - xformers
      - conda-pack

!conda env create -f environment.yml –force

Upload model artifacts to Amazon S3

SageMaker expects the .tar.gz file containing each Triton model repository to be hosted on the multi-model endpoint. Therefore, we create a tar artifact with content from the Triton model repository. We can use this S3 bucket to host thousands of model artifacts, and the SageMaker MME will use models from this location to dynamically load and serve a large number of models. We store all the Stable Diffusion models in this Amazon S3 location.

Deploy the SageMaker MME

In this section, we walk through the steps to deploy the SageMaker MME by defining container specification, SageMaker model and endpoint configurations.

Define the serving container

In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker MME will use to load and serve predictions. Set Mode to MultiModel to indicate that SageMaker will create the endpoint with the MME container specifications. We set the container with an image that supports deploying MMEs with GPU. See Supported algorithms, frameworks, and instances for more details.

We see all three model artifacts in the following Amazon S3 ModelDataUrl location:

container = {"Image": mme_triton_image_uri, 
             "ModelDataUrl": model_data_url, 
             "Mode": "MultiModel"}

Create an MME object

We use the SageMaker Boto3 client to create the model using the create_model API. We pass the container definition to the create model API along with ModelName and ExecutionRoleArn:

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, 
    ExecutionRoleArn=role, 
    PrimaryContainer=container
)

Define configurations for the MME

Create an MME configuration using the create_endpoint_config Boto3 API. Specify an accelerated GPU computing instance in InstanceType (we use the same instance type that we are using to host our SageMaker notebook). We recommend configuring your endpoints with at least two instances with real-life use cases. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

Create an MME

Use the preceding endpoint configuration to create a new SageMaker endpoint and wait for the deployment to finish:

create_endpoint_response = sm_client.create_endpoint(
                EndpointName=endpoint_name, 
                EndpointConfigName=endpoint_config_name
)

The status will change to InService when the deployment is successful.

Generate images using different versions of Stable Diffusion models

Let’s start by invoking the base model with a prompt and getting the generated image. We pass the inputs to the base model with prompt, negative_prompt, and gen_args as a dictionary. We set the data type and shape of each input item in the dictionary and pass it as input to the model.

inputs = dict(prompt = "Infinity pool on top of a high rise overlooking Central Park",
             negative_prompt = "blur,low detail, low quality",
             gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=8))
)
payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_base.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
decode_image(output[0]["data"][0])

Prompt: Infinity pool on top of a high rise overlooking Central Park

Working with this image, we can modify it with the versatile Stable Diffusion depth model. For example, we can change the style of the image to an oil painting, or change the setting from Central Park to Yellowstone National Park simply by passing the original image along with a prompt describing the changes we would like to see.

We invoke the depth model by specifying sd_depth.tar.gz in the TargetModel of the invoke_endpoint function call. In the outputs, notice how the orientation of the original image is preserved, but for one example, the NYC buildings have been transformed into rock formations of the same shape.

inputs = dict(prompt = "highly detailed oil painting of an inifinity pool overlooking central park",
              image=image,
              gen_args = json.dumps(dict(num_inference_steps=50, strength=0.9))
              )
payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_depth.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
print("original image")
display(original_image)
print("generated image")
display(decode_image(output[0]["data"][0]))


Original image	Oil painting	Yellowstone Park

Another useful model is Stable Diffusion inpainting, which we can use to remove certain parts of the image. Let’s say you want to remove the tree in the following example image. We can do so by invoking the inpaint model sd_inpaint.tar.gz. To remove the tree, we need to pass a mask_image, which indicates which regions of the image should be retained and which should be filled in. The black pixel portion of the mask image indicates the regions that should remain unchanged, and the white pixels indicate what should be replaced.

image = encode_image(original_image).decode("utf8")
mask_image = encode_image(Image.open("sample_images/bertrand-gabioud-mask.png")).decode("utf8")
inputs = dict(prompt = "building, facade, paint, windows",
              image=image,
              mask_image=mask_image,
              negative_prompt = "tree, obstruction, sky, clouds",
              gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=10))
              )
payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_inpaint.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
decode_image(output[0]["data"][0])


Original image	Mask image	Inpaint image

In our final example, we downsize the original image that was generated earlier from its 512 x 512 resolution to 128 x 128. We then invoke the Stable Diffusion upscaler model to upscale the image back to 512 x 512. We use the same prompt to upscale the image as what we used to generate the initial image. While not necessary, providing a prompt that describes the image helps guide the upscaling process and should lead to better results.

low_res_image = output_image.resize((128, 128))
inputs = dict(prompt = "Infinity pool on top of a high rise overlooking Central Park",
             image=encode_image(low_res_image).decode("utf8")
)

payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}

response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_upscale.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
upscaled_image = decode_image(output[0]["data"][0])


Low-resolution image	Upscaled image

Although the upscaled image is not as detailed as the original, it’s a marked improvement over the low-resolution one.

Optimize for memory and speed

The xformers library is a way to speed up image generation. This optimization is only available for NVIDIA GPUs. It speeds up image generation and lowers VRAM usage. We have used the xformers library for memory-efficient attention and speed. When the enable_xformers_memory_efficient_attention option is enabled, you should observe lower GPU memory usage and a potential speedup at inference time.

Clean Up

Follow the instruction in the clean up section of the notebook to delete the resource provisioned part of this blog to avoid unnecessary charges. Refer Amazon SageMaker Pricing for details the cost of the inference instances.

Conclusion

In this post, we discussed Stable Diffusion models and how you can deploy different versions of Stable Diffusion models cost-effectively using SageMaker multi-model endpoints. You can use this approach to build a creator image generation and editing tool. Check out the code samples in the GitHub repo to get started and let us know about the cool generative AI tool that you build.

About the Authors

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Vikram Elango is a Sr. AI/ML Specialist Solutions Architect at AWS, based in Virginia, US. He is currently focused on generative AI, LLMs, prompt engineering, large model inference optimization, and scaling ML across enterprises. Vikram helps financial and insurance industry customers with design and architecture to build and deploy ML applications at scale. In his spare time, he enjoys traveling, hiking, cooking, and camping with his family.

João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model training and inference optimization, and more broadly building large-scale ML platforms on AWS. He is also an active proponent of ML-specialized hardware and low-code ML solutions.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Build a powerful question answering bot with Amazon SageMaker, Amazon OpenSearch Service, Streamlit, and LangChain

May 25, 2023

by Amit Arora Amazon AWS

One of the most common applications of generative AI and large language models (LLMs) in an enterprise environment is answering questions based on the enterprise’s knowledge corpus. Amazon Lex provides the framework for building AI based chatbots. Pre-trained foundation models (FMs) perform well at natural language understanding (NLU) tasks such summarization, text generation and question answering on a broad variety of topics but either struggle to provide accurate (without hallucinations) answers or completely fail at answering questions about content that they haven’t seen as part of their training data. Furthermore, FMs are trained with a point in time snapshot of data and have no inherent ability to access fresh data at inference time; without this ability they might provide responses that are potentially incorrect or inadequate.

A commonly used approach to address this problem is to use a technique called Retrieval Augmented Generation (RAG). In the RAG-based approach we convert the user question into vector embeddings using an LLM and then do a similarity search for these embeddings in a pre-populated vector database holding the embeddings for the enterprise knowledge corpus. A small number of similar documents (typically three) is added as context along with the user question to the “prompt” provided to another LLM and then that LLM generates an answer to the user question using information provided as context in the prompt. RAG models were introduced by Lewis et al. in 2020 as a model where parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. To understand the overall structure of a RAG-based approach, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart.

In this post we provide a step-by-step guide with all the building blocks for creating an enterprise ready RAG application such as a question answering bot. We use a combination of different AWS services, open-source foundation models (FLAN-T5 XXL for text generation and GPT-j-6B for embeddings) and packages such as LangChain for interfacing with all the components and Streamlit for building the bot frontend.

We provide an AWS Cloud Formation template to stand up all the resources required for building this solution. We then demonstrate how to use LangChain for tying everything together:

Interfacing with LLMs hosted on Amazon SageMaker.
Chunking of knowledge base documents.
Ingesting document embeddings into Amazon OpenSearch Service.
Implementing the question answering task.

We can use the same architecture to swap the open-source models with the Amazon Titan models. After Amazon Bedrock launches, we will publish a follow-up post showing how to implement similar generative AI applications using Amazon Bedrock, so stay tuned.

Solution overview

We use the SageMaker docs as the knowledge corpus for this post. We convert the HTML pages on this site into smaller overlapping chunks (to retain some context continuity between chunks) of information and then convert these chunks into embeddings using the gpt-j-6b model and store the embeddings in OpenSearch Service. We implement the RAG functionality inside an AWS Lambda function with Amazon API Gateway to handle routing all requests to the Lambda. We implement a chatbot application in Streamlit which invokes the function via the API Gateway and the function does a similarity search in the OpenSearch Service index for the embeddings of user question. The matching documents (chunks) are added to the prompt as context by the Lambda function and then the function uses the flan-t5-xxl model deployed as a SageMaker endpoint to generate an answer to the user question. All the code for this post is available in the GitHub repo.

The following figure represents the high-level architecture of the proposed solution.

Figure 1: Architecture

Step-by-step explanation:

The User provides a question via the Streamlit web application.
The Streamlit application invokes the API Gateway endpoint REST API.
The API Gateway invokes the Lambda function.
The function invokes the SageMaker endpoint to convert user question into embeddings.
The function invokes invokes an OpenSearch Service API to find similar documents to the user question.
The function creates a “prompt” with the user query and the “similar documents” as context and asks the SageMaker endpoint to generate a response.
The response is provided from the function to the API Gateway.
The API Gateway provides the response to the Streamlit application.
The User is able to view the response on the Streamlit application,

As illustrated in the architecture diagram, we use the following AWS services:

SageMaker and Amazon SageMaker JumpStart for hosting the two LLMs.
OpenSearch Service for storing the embeddings of the enterprise knowledge corpus and doing similarity search with user questions.
Lambda for implementing the RAG functionality and exposing it as a REST endpoint via the API Gateway.
Amazon SageMaker Processing jobs for large scale data ingestion into OpenSearch.
Amazon SageMaker Studio for hosting the Streamlit application.
AWS Identity and Access Management roles and policies for access management.
AWS CloudFormation for creating the entire solution stack through infrastructure as code.

In terms of open-source packages used in this solution, we use LangChain for interfacing with OpenSearch Service and SageMaker, and FastAPI for implementing the REST API interface in the Lambda.

The workflow for instantiating the solution presented in this post in your own AWS account is as follows:

Run the CloudFormation template provided with this post in your account. This will create all the necessary infrastructure resources needed for this solution:
- SageMaker endpoints for the LLMs
- OpenSearch Service cluster
- API Gateway
- Lambda function
- SageMaker Notebook
- IAM roles
Run the data_ingestion_to_vectordb.ipynb notebook in the SageMaker notebook to ingest data from SageMaker docs into an OpenSearch Service index.
Run the Streamlit application on a terminal in Studio and open the URL for the application in a new browser tab.
Ask your questions about SageMaker via the chat interface provided by the Streamlit app and view the responses generated by the LLM.

These steps are discussed in detail in the following sections.

Prerequisites

To implement the solution provided in this post, you should have an AWS account and familiarity with LLMs, OpenSearch Service and SageMaker.

We need access to accelerated instances (GPUs) for hosting the LLMs. This solution uses one instance each of ml.g5.12xlarge and ml.g5.24xlarge; you can check the availability of these instances in your AWS account and request these instances as needed via a Sevice Quota increase request as shown in the following screenshot.

Figure 2: Service Quota Increase Request

Use AWS Cloud Formation to create the solution stack

We use AWS CloudFormation to create a SageMaker notebook called aws-llm-apps-blog and an IAM role called LLMAppsBlogIAMRole. Choose Launch Stack for the Region you want to deploy resources to. All parameters needed by the CloudFormation template have default values already filled in, except for the OpenSearch Service password which you’d have to provide. Make a note of the OpenSearch Service username and password, we use those in subsequent steps. This template takes about 15 minutes to complete.

AWS Region	Link
`us-east-1`
`us-west-2`
`eu-west-1`
`ap-northeast-1`

After the stack is created successfully, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the values for OpenSearchDomainEndpoint and LLMAppAPIEndpoint. We use those in the subsequent steps.

Figure 3: Cloud Formation Stack Outputs

Ingest the data into OpenSearch Service

To ingest the data, complete the following steps:

On the SageMaker console, choose Notebooks in the navigation pane.
Select the notebook aws-llm-apps-blog and choose Open JupyterLab.

Figure 4: Open JupyterLab
Choose data_ingestion_to_vectordb.ipynb to open it in JupyterLab. This notebook will ingest the SageMaker docs to an OpenSearch Service index called llm_apps_workshop_embeddings.

Figure 5: Open Data Ingestion Notebook
When the notebook is open, on the Run menu, choose Run All Cells to run the code in this notebook. This will download the dataset locally into the notebook and then ingest it into the OpenSearch Service index. This notebook takes about 20 minutes to run. The notebook also ingests the data into another vector database called FAISS. The FAISS index files are saved locally and the uploaded to Amazon Simple Storage Service (S3) so that they can optionally be used by the Lambda function as an illustration of using an alternate vector database.

Figure 6: Notebook Run All Cells

Now we’re ready to split the documents into chunks, which can then be converted into embeddings to be ingested into OpenSearch. We use the LangChain RecursiveCharacterTextSplitter class to chunk the documents and then use the LangChain SagemakerEndpointEmbeddingsJumpStart class to convert these chunks into embeddings using the gpt-j-6b LLM. We store the embeddings in OpenSearch Service via the LangChain OpenSearchVectorSearch class. We package this code into Python scripts that are provided to the SageMaker Processing Job via a custom container. See the data_ingestion_to_vectordb.ipynb notebook for the full code.

Create a custom container, then install in it the LangChain and opensearch-py Python packages.
Upload this container image to Amazon Elastic Container Registry (ECR).

We use the SageMaker ScriptProcessor class to create a SageMaker Processing job that will run on multiple nodes.

The data files available in Amazon S3 are automatically distributed across in the SageMaker Processing job instances by setting s3_data_distribution_type='ShardedByS3Key' as part of the ProcessingInput provided to the processing job.
Each node processes a subset of the files and this brings down the overall time required to ingest the data into OpenSearch Service.

Each node also uses Python multiprocessing to internally also parallelize the file processing. Therefore, there are two levels of parallelization happening, one at the cluster level where individual nodes are distributing the work (files) amongst themselves and another at the node level where the files in a node are also split between multiple processes running on the node.

 # setup the ScriptProcessor with the above parameters
processor = ScriptProcessor(base_job_name=base_job_name,
                            image_uri=image_uri,
                            role=aws_role,
                            instance_type=instance_type,
                            instance_count=instance_count,
                            command=["python3"],
                            tags=tags)

# setup input from S3, note the ShardedByS3Key, this ensures that 
# each instance gets a random and equal subset of the files in S3.
inputs = [ProcessingInput(source=f"s3://{bucket}/{app_name}/{DOMAIN}",
                          destination='/opt/ml/processing/input_data',
                          s3_data_distribution_type='ShardedByS3Key',
                          s3_data_type='S3Prefix')]


logger.info(f"creating an opensearch index with name={opensearch_index}")
# ready to run the processing job
st = time.time()
processor.run(code="container/load_data_into_opensearch.py",
              inputs=inputs,
              outputs=[],
              arguments=["--opensearch-cluster-domain", opensearch_domain_endpoint,
                        "--opensearch-secretid", os_creds_secretid_in_secrets_manager,
                        "--opensearch-index-name", opensearch_index,
                        "--aws-region", aws_region,
                        "--embeddings-model-endpoint-name", embeddings_model_endpoint_name,
                        "--chunk-size-for-doc-split", str(CHUNK_SIZE_FOR_DOC_SPLIT),
                        "--chunk-overlap-for-doc-split", str(CHUNK_OVERLAP_FOR_DOC_SPLIT),
                        "--input-data-dir", "/opt/ml/processing/input_data",
                        "--create-index-hint-file", CREATE_OS_INDEX_HINT_FILE,
                        "--process-count", "2"])

Close the notebook after all cells run without any error. Your data is now available in OpenSearch Service. Enter the following URL in your browser’s address bar to get a count of documents in the llm_apps_workshop_embeddings index. Use the OpenSearch Service domain endpoint from the CloudFormation stack outputs in the URL below. You’d be prompted for the OpenSearch Service username and password, these are available from the CloudFormations stack.
```
https://your-opensearch-domain-endpoint/llm_apps_workshop_embeddings/_count
```

The browser window should show an output similar to the following. This output shows that 5,667 documents were ingested into the llm_apps_workshop_embeddings index. {"count":5667,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}

Run the Streamlit application in Studio

Now we’re ready to run the Streamlit web application for our question answering bot. This application allows the user to ask a question and then fetches the answer via the /llm/rag REST API endpoint provided by the Lambda function.

Studio provides a convenient platform to host the Streamlit web application. The following steps describes how to run the Streamlit app on Studio. Alternatively, you could also follow the same procedure to run the app on your laptop.

Open Studio and then open a new terminal.
Run the following commands on the terminal to clone the code repository for this post and install the Python packages needed by the application:
```
git clone https://github.com/aws-samples/llm-apps-workshop
cd llm-apps-workshop/blogs/rag/app
pip install -r requirements.txt
```
The API Gateway endpoint URL that is available from the CloudFormation stack output needs to be set in the webapp.py file. This is done by running the following sed command. Replace the replace-with-LLMAppAPIEndpoint-value-from-cloudformation-stack-outputs in the shell commands with the value of the LLMAppAPIEndpoint field from the CloudFormation stack output and then run the following commands to start a Streamlit app on Studio.
```
EP=replace-with-LLMAppAPIEndpoint-value-from-cloudformation-stack-outputs
# replace __API_GW_ENDPOINT__ with  output from the cloud formation stack
sed -i "s|__API_GW_ENDPOINT__|$EP|g" webapp.py
streamlit run webapp.py
```
When the application runs successfully, you’ll see an output similar to the following (the IP addresses you will see will be different from the ones shown in this example). Note the port number (typically 8501) from the output to use as part of the URL for app in the next step.
```
sagemaker-user@studio$ streamlit run webapp.py 

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.

You can now view your Streamlit app in your browser.

Network URL: http://169.255.255.2:8501
External URL: http://52.4.240.77:8501
```
You can access the app in a new browser tab using a URL that is similar to your Studio domain URL. For example, if your Studio URL is https://d-randomidentifier.studio.us-east-1.sagemaker.aws/jupyter/default/lab? then the URL for your Streamlit app will be https://d-randomidentifier.studio.us-east-1.sagemaker.aws/jupyter/default/proxy/8501/webapp (notice that lab is replaced with proxy/8501/webapp). If the port number noted in the previous step is different from 8501 then use that instead of 8501 in the URL for the Streamlit app.

The following screenshot shows the app with a couple of user questions.

A closer look at the RAG implementation in the Lambda function

Now that we have the application working end to end, lets take a closer look at the Lambda function. The Lambda function uses FastAPI to implement the REST API for RAG and the Mangum package to wrap the API with a handler that we package and deploy in the function. We use the API Gateway to route all incoming requests to invoke the function and handle the routing internally within our application.

The following code snippet shows how we find documents in the OpenSearch index that are similar to the user question and then create a prompt by combining the question and the similar documents. This prompt is then provided to the LLM for generating an answer to the user question.

@router.post("/rag")
async def rag_handler(req: Request) -> Dict[str, Any]:
    # dump the received request for debugging purposes
    logger.info(f"req={req}")

    # initialize vector db and SageMaker Endpoint
    _init(req)

    # Use the vector db to find similar documents to the query
    # the vector db call would automatically convert the query text
    # into embeddings
    docs = _vector_db.similarity_search(req.q, k=req.max_matching_docs)
    logger.info(f"here are the {req.max_matching_docs} closest matching docs to the query="{req.q}"")
    for d in docs:
        logger.info(f"---------")
        logger.info(d)
        logger.info(f"---------")

    # now that we have the matching docs, lets pack them as a context
    # into the prompt and ask the LLM to generate a response
    prompt_template = """Answer based on context:nn{context}nn{question}"""

    prompt = PromptTemplate(
        template=prompt_template, input_variables=["context", "question"]
    )
    logger.info(f"prompt sent to llm = "{prompt}"")
    chain = load_qa_chain(llm=_sm_llm, prompt=prompt)
    answer = chain({"input_documents": docs, "question": req.q}, return_only_outputs=True)['output_text']
    logger.info(f"answer received from llm,nquestion: "{req.q}"nanswer: "{answer}"")
    resp = {'question': req.q, 'answer': answer}
    if req.verbose is True:
        resp['docs'] = docs

    return resp

Clean up

To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation stack as shown in the following screenshot.

Figure 7: Cleaning Up

Conclusion

In this post, we showed how to create an enterprise ready RAG solution using a combination of AWS service, open-source LLMs and open-source Python packages.

We encourage you to learn more by exploring JumpStart, Amazon Titan models, Amazon Bedrock, and OpenSearch Service and building a solution using the sample implementation provided in this post and a dataset relevant to your business. If you have questions or suggestions, leave a comment.

About the Authors

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Navneet Tuteja is a Data Specialist at Amazon Web Services. Before joining AWS, Navneet worked as a facilitator for organizations seeking to modernize their data architectures and implement comprehensive AI/ML solutions. She holds an engineering degree from Thapar University, as well as a master’s degree in statistics from Texas A&M University.

Get insights on your user’s search behavior from Amazon Kendra using an ML-powered serverless stack

May 25, 2023

by Genta Watanabe Amazon AWS

Amazon Kendra is a highly accurate and intelligent search service that enables users to search unstructured and structured data using natural language processing (NLP) and advanced search algorithms. With Amazon Kendra, you can find relevant answers to your questions quickly, without sifting through documents. However, just enabling end-users to get the answers to their queries is not enough in today’s world. We need to constantly understand the end-user’s search behavior, such as what are the top queries for the month, have any new query that queries appeared recently, what percentage of queries received instant answer, and more.

Although the Amazon Kendra console comes equipped with an analytics dashboard, many of our customers prefer to build a custom dashboard. This allows you to create unique views and filters, and grants management teams access to a streamlined, one-click dashboard without needing to log in to the AWS Management Console and search for the appropriate dashboard. In addition, you can enhance your dashboard’s functionality by adding preprocessing logic, such as grouping similar top queries. For example, you may want to group similar queries such as “What is Amazon Kendra” and “What is the purpose of Amazon Kendra” together so that you can effectively analyze the metrics and gain a deeper understanding of the data. Such grouping of similar queries can be done using the concept of semantic similarity.

This post discusses an end-to-end solution to implement this use case, which includes using AWS Lambda to extract the summarized metrics from Amazon Kendra, calculating the semantic similarity score using a Hugging Face model hosted on an Amazon SageMaker Serverless Inference endpoint to group similar queries, and creating an Amazon QuickSight dashboard to display the user insights effectively.

Solution overview

The following diagram illustrates our solution architecture.

The high-level workflow is as follows:

An Amazon EventBridge scheduler triggers Lambda functions once a month to extract last month’s search metrics from Amazon Kendra.
The Lambda functions upload the search metrics to an Amazon Simple Storage Service (Amazon S3) bucket.
The Lambda functions group similar queries in the uploaded file based on the semantic similarity score by Hugging Face model hosted on a SageMaker inference endpoint.
An AWS Glue crawler creates or updates the AWS Glue Data Catalog from the uploaded file in the S3 bucket for an Amazon Athena table.
QuickSight uses the Athena table dataset to create analyses and dashboards.

For this solution, we deploy the infrastructure resources to create the QuickSight analysis and dashboard using an AWS CloudFormation template.

Prerequisites

Complete the following prerequisite steps:

If you’re a first-time user of QuickSight in your AWS account, sign up for QuickSight.
Get the Amazon Kendra index ID that you want visualize your search metrics from Amazon Kendra. You will have to use the search engine for a while (for example, a few weeks) to be able to extract a sufficient amount of data to use to extract some insights.
Clone the GitHub repo to create the container image:
Create an Amazon Elastic Container Registry (Amazon ECR) repository in us-east-1 and push the container image created by the downloaded Dockerfile. For instructions, refer to Creating a private repository.
Run the following commands in the directory of your local environment to create and push the container image to the ECR repository you created:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com
docker build -t <YOUR_ECR_REPOSITORY_NAME> .
docker tag <YOUR_ECR_REPOSITORY_NAME>:latest <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/<YOUR_ECR_REPOSITORY_NAME>:latest 
docker push <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/<YOUR_ECR_REPOSITORY_NAME>:latest

Deploy the CloudFormation template

Complete the following steps to deploy the CloudFormation template:

Download the CloudFormation template kendrablog-sam-template.yml.
On the AWS CloudFormation console, create a new stack.

Use the us-east-1 Region for this deployment.

Upload the template directly or through your preferred S3 bucket.
For KendraIndex, enter the Amazon Kendra index ID from the prerequisites.
For LambdaECRRepository, enter the ECR repository from the prerequisites.
For QSIdentityRegion, enter the identity Region of QuickSight. The identity Region aligns with your Region selection when you signed up your QuickSight subscription.
For QSUserDefaultPassward, enter the default password to use for your QuickSight user.

You’ll be prompted to change this password when you first sign in to the QuickSight console.

For QSUserEmail, enter the email address to use for the QuickSight user.
Choose Next.
Leave other settings as default and choose Next.
Select the acknowledgement check boxes and choose Create stack.

When the deployment is complete, you can confirm all the generated resources on the stack’s Resources tab on the AWS CloudFormation console.

We walk through some of the key components of this solution in the following sections.

Get insights from Amazon Kendra search metrics

We can get the metrics data from Amazon Kendra using the GetSnapshots API. There are 10 metrics for analyzing what information the users are searching for: 5 metrics include trends data for us to look for patterns over time, and 5 metrics use just a snapshot or aggregated data. The metrics with the daily trend data are clickthrough rate, zero click rate, zero search results rate, instant answer rate, and total queries. The metrics with aggregated data are top queries, top queries with zero clicks, top queries with zero search results, top clicked on documents, and total documents.

We use Lambda functions to get the search metrics data from Amazon Kendra. The functions extract the metrics from Amazon Kendra and store them in Amazon S3. You can find the functions in the GitHub repo.

Create a SageMaker serverless endpoint and host a Hugging Face model to calculate semantic similarity

After the metrics are extracted, the next step is to complete the preprocessing for the aggregated metrics. The preprocessing step checks the semantic similarity between the query texts and groups them together to show the total counts for the similar queries. For example, if there are three queries of “What is S3” and two queries of “What is the purpose of S3,” it will group them together and show that there are five queries of “What is S3” or “What is the purpose of S3.”

To calculate semantic similarity, we use a model from the Hugging Face model library. Hugging Face is a popular open-source platform that provides a wide range of NLP models, including transformers, which have been trained on a variety of NLP tasks. These models can be easily integrated with SageMaker and take advantage of its rich training and deployment options. The Hugging Face Deep Learning Containers (DLCs), which comes pre-packaged with the necessary libraries, make it easy to deploy the model in SageMaker with just few lines of code. In our use case, we first get the vector embedding using the Hugging Face pre-trained model flax-sentence-embeddings/all_datasets_v4_MiniLM-L6, and then use cosine similarity to calculate the similarity score between the vector embeddings.

To get the vector embedding from the Hugging Face model, we create a serverless endpoint in SageMaker. Serverless endpoints help save cost because you only pay for the amount of time the inference runs. To create a serverless endpoint, you first define the max concurrent invocations for a single endpoint, known as MaxConcurrency, and the memory size. The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. SageMaker Serverless Inference auto-assigns compute resources proportional to the memory you select.

We also need to pad one of the vectors with zeros so that the size of the two vectors matches with each other and we can calculate the cosine similarity as a dot product of the two vectors. We can set a threshold for cosine similarity (for example, 0.6) and if the similarity score is more than the threshold, we can group the queries together. After the queries are grouped, we can understand the top queries better. We put all this logic in a Lambda function and deploy the function using a container image. The container image contains codes to invoke the SageMaker Serverless Inference endpoints, and necessary Python libraries to run the Lambda function such as NumPy, pandas, and scikit-learn. The following file is an example of the output from the Lambda function: HF_QUERIES_BY_COUNT.csv.

Create a dashboard using QuickSight

After you have collected the metrics and preprocessed the aggregated metrics, you can visualize the data to get the business insights. For this solution, we use QuickSight for the business intelligence (BI) dashboard and Athena as the data source for QuickSight.

QuickSight is a fully managed enterprise-grade BI service that you can use to create analyses and dashboards to deliver easy-to-understand insights. You can choose various types of charts and graphs to deliver the business insights effectively through a QuickSight dashboard. QuickSight connects to your data and combines data from many different sources, such as Amazon S3 and Athena. For our solution, we use Athena as the data source.

Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. You can use Athena queries to create your custom views from data stored in an S3 bucket before visualizing it with QuickSight. This solution uses an AWS Glue crawler to create the AWS Glue Data Catalog for the Athena table from the files in the S3 bucket.

The CloudFormation template runs the first crawler during resource creation. The following screenshot shows the Data Catalog schema.

The following screenshot shows the Athena table sample you will see after the deployment.

Access permission to the AWS Glue databases and tables are managed by AWS Lake Formation. The CloudFormation template already attached the necessary Lake Formation permissions to the generated AWS Identity and Access Management (IAM) user for QuickSight. If you see permission issues with your IAM principal, grant at least the SELECT permission to the AWS Glue tables to your IAM principal in Lake Formation. You can find the AWS Glue database name on the Outputs tab of the CloudFormation stack. For more information, refer to Granting Data Catalog permissions using the named resource method.

We have completed the data preparation step. The last step is to create an analysis and dashboard using QuickSight.

Sign in to the QuickSight console with the QuickSight user that the CloudFormation template generated.
In the navigation pane, choose Datasets.
Choose Dataset.
Choose Athena as the data source.
Enter a name for Data Source name and choose kendrablog for Athena workgroup.
Choose Create data source.
Choose AWSDataCatalog for Catalog and kendra-search-analytics-database for Database, and select one of the tables you want to use for analysis.
Choose Select.
Select Import to SPICE for quicker analytics and choose Edit/Preview data.
Optionally, choose Add data to join additional data.
You can also modify the data schema, such as column name or data type, and join multiple datasets, if needed.
Choose Publish & Visualize to move on to creating visuals.
Choose your visual type and set dimensions to create your visual.
You can optionally configure additional features for the chart using the navigation pane, such as filters, actions, and themes.

The following screenshots show a sample QuickSight dashboard for your reference. “Search Queries group by similar queries” in the screenshot shows how the search queries been consolidated using semantic similarity.

Clean up

Delete the QuickSight resources (dashboard, analysis, and dataset) that you created and infrastructure resources that AWS CloudFormation generated to avoid unwanted charges. You can delete the infrastructure resource and QuickSight user that was created by the stack via the AWS CloudFormation console.

Conclusion

This post showed an end-to-end solution to get business insights from Amazon Kendra. The solution provided the serverless stack to deploy a custom dashboard for Amazon Kendra search analytics metrics using Lambda and QuickSight. We also solved common challenges relating to analyzing similar queries using a SageMaker Hugging Face model. You could further enhance the dashboard by adding more insights such as the key phrases or the named entities in the queries using Amazon Comprehend and displaying those in the dashboard. Please try out the solution and let us know your feedback.

About the Authors

Genta Watanabe is a Senior Technical Account Manager at Amazon Web Services. He spends his time working with strategic automotive customers to help them achieve operational excellence. His areas of interest are machine learning and artificial intelligence. In his spare time, Genta enjoys spending time with his family and traveling.

Abhijit Kalita is a Senior AI/ML Evangelist at Amazon Web Services. He spends his time working with public sector partners in Asia Pacific, enabling them on their AI/ML workloads. He has many years of experience in data analytics, AI, and machine learning across different verticals such as automotive, semiconductor manufacturing, and financial services. His areas of interest are machine learning and artificial intelligence, especially NLP and computer vision. In his spare time, Abhijit enjoys spending time with his family, biking, and playing with his little hamster.

How OCX Cognition reduced ML model development time from weeks to days and model update time from days to real time using AWS Step Functions and Amazon SageMaker

May 25, 2023

by Brian Curry Amazon AWS

This post was co-authored by Brian Curry (Founder and Head of Products at OCX Cognition) and Sandhya MN (Data Science Lead at InfoGain)

OCX Cognition is a San Francisco Bay Area-based startup, offering a commercial B2B software as a service (SaaS) product called Spectrum AI. Spectrum AI is a predictive (generative) CX analytics platform for enterprises. OCX’s solutions are developed in collaboration with Infogain, an AWS Advanced Tier Partner. Infogain works with OCX Cognition as an integrated product team, providing human-centered software engineering services and expertise in software development, microservices, automation, Internet of Things (IoT), and artificial intelligence.

The Spectrum AI platform combines customer attitudes with customers’ operational data and uses machine learning (ML) to generate continuous insight on CX. OCX built Spectrum AI on AWS because AWS offered a wide range of tools, elastic computing, and an ML environment that would keep pace with evolving needs.

In this post, we discuss how OCX Cognition with the support of Infogain and OCX’s AWS account team improved their end customer experience and reduced time to value by automating and orchestrating ML functions that supported Spectrum AI’s CX analytics. Using AWS Step Functions, the AWS Step Functions Data Science SDK for Python, and Amazon SageMaker Experiments, OCX Cognition reduced ML model development time from 6 weeks to 2 weeks and reduced ML model update time from 4 days to near-real time.

Background

The Spectrum AI platform has to produce models tuned for hundreds of different generative CX scores for each customer, and these scores need to be uniquely computed for tens of thousands of active accounts. As time passes and new experiences accumulate, the platform has to update these scores based on new data inputs. After new scores are produced, OCX and Infogain compute the relative impact of each underlying operational metric in the prediction. Amazon SageMaker is a web-based integrated development environment (IDE) that allows you to build, train, and deploy ML models for any use case with fully managed infrastructure, tools, and workflows. With SageMaker, the OCX-Infogain team developed their solution using shared code libraries across individually maintained Jupyter notebooks in Amazon SageMaker Studio.

The problem: Scaling the solution for multiple customers

While the initial R&D proved successful, scaling posed a challenge. OCX and Infogain’s ML development involved multiple steps: feature engineering, model training, prediction, and the generation of analytics. The code for modules resided in multiple notebooks, and running these notebooks was manual, with no orchestration tool in place. For every new customer, the OCX-Infogain team spent 6 weeks per customer on model development time because libraries couldn’t be reused. Due to the amount of time spent on model development, the OCX-Infogain team needed an automated and scalable solution that operated as a singular platform using unique configurations for each of their customers.

The following architecture diagram depicts OCX’s initial ML model development and update processes.

Solution overview

To simplify the ML process, the OCX-Infogain team worked with the AWS account team to develop a custom declarative ML framework to replace all repetitive code. This reduced the need to develop new low-level ML code. New libraries could be reused for multiple customers by configuring the data appropriately for each customer through YAML files.

While this high-level code continues to be developed initially in Studio using Jupyter notebooks, it’s then converted to Python (.py files), and the SageMaker platform is used to build a Docker image with BYO (bring your own) containers. The Docker images are then pushed to Amazon Elastic Container Registry (Amazon ECR) as a preparatory step. Finally, the code is run using Step Functions.

The AWS account team recommended the Step Functions Data Science SDK and SageMaker Experiments to automate feature engineering, model training, and model deployment. The Step Functions Data Science SDK was used to generate the step functions programmatically. The OCX-Infogain team learned how to use features like Parallel and MAP within Step Functions to orchestrate a large number of training and processing jobs in parallel, which reduces the runtime. This was combined with Experiments, which functions as an analytics tool, tracking multiple ML candidates and hyperparameter tuning variations. These built-in analytics allowed the OCX-Infogain team to compare multiple metrics at runtime and identify best-performing models on the fly.

The following architecture diagram shows the MLOps pipeline developed for the model creation cycle.

The Step Functions Data Science SDK is used to analyze and compare multiple model training algorithms. The state machine runs multiple models in parallel, and each model output is logged into Experiments. When model training is complete, the results of multiple experiments are retrieved and compared using the SDK. The following screenshots show how the best performing model is selected for each stage.

The following are the high-level steps of the ML lifecycle:

ML developers push their code into libraries on the Gitlab repository when development in Studio is complete.
AWS CodePipeline is used to check out the appropriate code from the Gitlab repository.
A Docker image is prepared using this code and pushed to Amazon ECR for serverless computing.
Step Functions is used to run steps using Amazon SageMaker Processing jobs. Here, multiple independent tasks are run in parallel:
- Feature engineering is performed, and the features are stored in the feature store.
- Model training is run, with multiple algorithms and several combinations of hyperparameters utilizing the YAML configuration file.
- The training step function is designed to have heavy parallelism. The models for each journey stage are run in parallel. This is depicted in the following diagram.

Model results are then logged in Experiments. The best-performing model is selected and pushed to the model registry.
Predictions are made using the best-performing models for each CX analytic we generate.
Hundreds of analytics are generated and then handed off for publication in a data warehouse hosted on AWS.

Results

With this approach, OCX Cognition has automated and accelerated their ML processing. By replacing labor-intensive manual processes and highly repetitive development burdens, the cost per customer is reduced by over 60%. This also allows OCX to scale their software business by tripling overall capacity and doubling capacity for simultaneous onboarding of customers. OCX’s automating of their ML processing unlocks new potential to grow through customer acquisition. Using SageMaker Experiments to track model training is critical to identifying the best set of models to use and take to production. For their customers, this new solution provides not only an 8% improvement in ML performance, but a 63% improvement in time to value. New customer onboarding and the initial model generation has improved from 6 weeks to 2 weeks. Once built and in place, OCX begins to continuously regenerate the CX analytics as new input data arrives from the customer. These update cycles have improved from 4 days to near-real time

Conclusion

In this post, we showed how OCX Cognition and Infogain utilized Step Functions, the Step Functions Data Science SDK for Python, and Sagemaker Experiments in conjunction with Sagemaker Studio to reduce time to value for the OCX-InfoGain team in developing and updating CX analytics models for their customers.

To get started with these services, refer to Amazon SageMaker, AWS Step Functions Data Science Python SDK, AWS Step Functions, and Manage Machine Learning with Amazon SageMaker Experiments.

About the Authors

Brian Curry is currently a founder and Head of Products at OCX Cognition, where we are building a machine learning platform for customer analytics. Brian has more than a decade of experience leading cloud solutions and design-centered product organizations.

Sandhya M N is part of Infogain and leads the Data Science team for OCX. She is a seasoned software development leader with extensive experience across multiple technologies and industry domains. She is passionate about staying up to date with technology and using it to deliver business value to customers.

Prashanth Ganapathy is a Senior Solutions Architect in the Small Medium Business (SMB) segment at AWS. He enjoys learning about AWS AI/ML services and helping customers meet their business outcomes by building solutions for them. Outside of work, Prashanth enjoys photography, travel, and trying out different cuisines.

Sabha Parameswaran is a Senior Solutions Architect at AWS with over 20 years of deep experience in enterprise application integration, microservices, containers and distributed systems performance tuning, prototyping, and more. He is based out of the San Francisco Bay Area. At AWS, he is focused on helping customers in their cloud journey and is also actively involved in microservices and serverless-based architecture and frameworks.

Vaishnavi Ganesan is a Solutions Architect at AWS based in the San Francisco Bay Area. She is focused on helping Commercial Segment customers on their cloud journey and is passionate about security in the cloud. Outside of work, Vaishnavi enjoys traveling, hiking, and trying out various coffee roasters.

Ajay Swaminathan is an Account Manager II at AWS. He is an advocate for Commercial Segment customers, providing the right financial, business innovation, and technical resources in accordance with his customers’ goals. Outside of work, Ajay is passionate about skiing, dubstep and drum and bass music, and basketball.

“Building a model that can save as many lives as possible”

May 24, 2023

by Amazon AWS

How ARA recipient Supreeth Shashikumar is using machine learning to help hospitals detect sepsis — before it’s too late.Read More

Dialogue-guided intelligent document processing with foundation models on Amazon SageMaker JumpStart

May 24, 2023

by Alfred Shen Amazon AWS

Intelligent document processing (IDP) is a technology that automates the processing of high volumes of unstructured data, including text, images, and videos. IDP offers a significant improvement over manual methods and legacy optical character recognition (OCR) systems by addressing challenges such as cost, errors, low accuracy, and limited scalability, ultimately leading to better outcomes for organizations and stakeholders.

Natural language processing (NLP) is one of the recent developments in IDP that has improved accuracy and user experience. However, despite these advances, there are still challenges to overcome. For instance, many IDP systems are not user-friendly or intuitive enough for easy adoption by users. Additionally, several existing solutions lack the capability to adapt to changes in data sources, regulations, and user requirements through continuous improvement and updates.

Enhancing IDP through dialogue involves incorporating dialogue capabilities into IDP systems. By enabling users to interact with IDP systems in a more natural and intuitive way, through multi-round dialogue by adjusting inaccurate information or adding missing information aided with task automation, these systems can become more efficient, accurate, and user-friendly.

In this post, we explore an innovative approach to IDP that utilizes a dialogue-guided query solution using Amazon Foundation Models and SageMaker JumpStart.

Solution overview

This innovative solution combines OCR for information extraction, a local deployed large language model (LLM) for dialogue and autonomous tasking, VectorDB for embedding subtasks, and LangChain-based task automation for integration with external data sources to transform the way businesses process and analyze document contexts. By harnessing generative AI technologies, organizations can streamline IDP workflows, enhance user experience, and boost overall efficiency.

The following video highlights the dialogue-guided IDP system by processing an article authored by the Federal Reserve Board of Governors, discussing the collapse of Silicon Valley Bank in March 2023.

The system is capable of processing images, large PDF, and documents in other format and answering questions derived from the content via interactive text or voice inputs. If a user needs to inquire beyond the document’s context, the dialogue-guided IDP can create a chain of tasks from the text prompt and then reference external and up-to-date data sources for relevant answers. Additionally, it supports multi-round conversations and accommodates multilingual exchanges, all managed through dialogue.

Deploy your own LLM using Amazon foundation models

One of the most promising developments in generative AI is the integration of LLMs into dialogue systems, opening up new avenues for more intuitive and meaningful exchanges. An LLM is a type of AI model designed to understand and generate human-like text. These models are trained on massive amounts of data and consist of billions of parameters, allowing them to perform various language-related tasks with high accuracy. This transformative approach facilitates a more natural and productive interaction, bridging the gap between human intuition and machine intelligence. A key advantage of local LLM deployment lies in its ability to enhance data security without submitting data outside to third-party APIs. Moreover, you can fine-tune your chosen LLM with domain-specific data, resulting in a more accurate, context-aware, and natural language understanding experience.

The Jurassic-2 series from AI21 Labs, which are based on the instruct-tuned 178-billion-parameter Jurassic-1 LLM, are integral parts of the Amazon foundation models available through Amazon Bedrock. The Jurassic-2 instruct was specifically trained to manage prompts that are instructions only, known as zero-shot, without the need for examples, or few-shot. This method provides the most intuitive interaction with LLMs, and it’s the best approach to understand the ideal output for your task without requiring any examples. You can efficiently deploy the pre-trained J2-jumbo-instruct, or other Jurassic-2 models available on AWS Marketplace, into your own own virtual private cloud (VPC) using Amazon SageMaker. See the following code:

import ai21, sagemaker

# Define endpoint name
endpoint_name = "sagemaker-soln-j2-jumbo-instruct"
# Define real-time inference instance type. You can also choose g5.48xlarge or p4de.24xlarge instance types
# Please request P instance quota increase via <a href="https://console.aws.amazon.com/servicequotas/home" target="_blank" rel="noopener">Service Quotas console</a> or your account manager
real_time_inference_instance_type = ("ml.p4d.24xlarge")

# Create a Sgaemkaer endpoint then deploy a pre-trained J2-jumbo-instruct-v1 model from AWS Market Place.
model_package_arn = "arn:aws:sagemaker:us-east-1:865070037744:model-package/j2-jumbo-instruct-v1-0-20-8b2be365d1883a15b7d78da7217cdeab"
model = ModelPackage(
role=sagemaker.get_execution_role(),
model_package_arn=model_package_arn,
sagemaker_session=sagemaker.Session()
)

# Deploy the model
predictor = model.deploy(1, real_time_inference_instance_type,
endpoint_name=endpoint_name,
model_data_download_timeout=3600,
container_startup_health_check_timeout=600,
)

After the endpoint has been successfully deployed within your own VPC, you can initiate an inference task to verify that the deployed LLM is functioning as anticipated:

response_jumbo_instruct = ai21.Completion.execute(
sm_endpoint=endpoint_name,
prompt="Explain deep learning algorithms to 8th graders",
numResults=1,
maxTokens=100,
temperature=0.01 #subject to reduce “hallucination” by using common words.
)

Document processing, embedding, and indexing

We delve into the process of building an efficient and effective search index, which forms the foundation for intelligent and responsive dialogues to guide document processing. To begin, we convert documents from various formats into text content using OCR and Amazon Textract. We then read this content and fragment it into smaller pieces, ideally around the size of a sentence each. This granular approach allows for more precise and relevant search results, because it enables better matching of queries against individual segments of a page rather than the entire document. To further enhance the process, we use embeddings such as the sentence transformers library from Hugging Face, which generates vector representations (encoding) of each sentence. These vectors serve as a compact and meaningful representation of the original text, enabling efficient and accurate semantic matching functionality. Finally, we store these vectors in a vector database for similarity search. This combination of techniques lays the groundwork for a novel document processing framework that delivers accurate and intuitive results for users. The following diagram illustrates this workflow.

OCR serves as a crucial element in the solution, allowing for the retrieval of text from scanned documents or pictures. We can use Amazon Textract for extracting text from PDF or image files. This managed OCR service is capable of identifying and examining text in multi-page documents, including those in PDF, JPEG or TIFF formats, such as invoices and receipts. The processing of multi-page documents occurs asynchronously, making it advantageous for handling extensive, multi-page documents. See the following code:

def pdf_2_text(input_pdf_file, history):
history = history or []
key = 'input-pdf-files/{}'.format(os.path.basename(input_pdf_file.name))
try:
response = s3_client.upload_file(input_pdf_file.name, default_bucket_name, key)
except ClientError as e:
print("Error uploading file to S3:", e)
s3_object = {'Bucket': default_bucket_name, 'Name': key}
response = textract_client.start_document_analysis(
DocumentLocation={'S3Object': s3_object},
FeatureTypes=['TABLES', 'FORMS']
)
job_id = response['JobId']
while True:
response = textract_client.get_document_analysis(JobId=job_id)
status = response['JobStatus']
if status in ['SUCCEEDED', 'FAILED']:
break
time.sleep(5)

if status == 'SUCCEEDED':
with open(output_file, 'w') as output_file_io:
for block in response['Blocks']:
if block['BlockType'] in ['LINE', 'WORD']:
output_file_io.write(block['Text'] + 'n')
with open(output_file, "r") as file:
first_512_chars = file.read(512).replace("n", "").replace("r", "").replace("[", "").replace("]", "") + " [...]"
history.append(("Document conversion", first_512_chars))
return history, history

When dealing with large documents, it’s crucial to break them down into more manageable pieces for easier processing. In the case of LangChain, this means dividing each document into smaller segments, such as 1,000 tokens per chunk with an overlap of 100 tokens. To achieve this smoothly, LangChain utilizes specialized splitters designed specifically for this purpose:

from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
separator = 'n'
overlap_count = 100. # overlap count between the splits
chunk_size = 1000 # Use a fixed split unit size
loader = TextLoader(output_file)
documents = loader.load()
text_splitter = CharacterTextSplitter(separator=separator, chunk_overlap=overlap_count, chunk_size=chunk_size, length_function=len)
texts = text_splitter.split_documents(documents)

The duration needed for embedding can fluctuate based on the size of the document; for example, it could take roughly 10 minutes to finish. Although this time frame may not be substantial when dealing with a single document, the ramifications become more notable when indexing hundreds of gigabytes as opposed to just hundreds of megabytes. To expedite the embedding process, you can implement sharding, which enables parallelization and consequently enhances efficiency:

from langchain.document_loaders import ReadTheDocsLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import numpy as np
import ray
from embeddings import LocalHuggingFaceEmbeddings

# Define number of splits
db_shards = 10

loader = TextLoader(output_file)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1000,
chunk_overlap  = 100,
length_function = len,
)

@ray.remote()
def process_shard(shard):
embeddings = LocalHuggingFaceEmbeddings('multi-qa-mpnet-base-dot-v1')
result = Chroma.from_documents(shard, embeddings)
return result

# Read the doc content and split them into chunks.
chunks = text_splitter.create_documents([doc.page_content for doc in documents], metadatas=[doc.metadata for doc in documents])
# Embed the doc chunks into vectors.
shards = np.array_split(chunks, db_shards)
futures = [process_shard.remote(shards[i]) for i in range(db_shards)]
texts = ray.get(futures)

Now that we have obtained the smaller segments, we can continue to represent them as vectors through embeddings. Embeddings, a technique in NLP, generate vector representations of text prompts. The Embedding class serves as a unified interface for interacting with various embedding providers, such as SageMaker, Cohere, Hugging Face, and OpenAI, which streamlines the process across different platforms. These embeddings are numeric portrayals of ideas transformed into number sequences, allowing computers to effortlessly comprehend the connections between these ideas. See the following code:

# Choose a SageMaker deployed local LLM endpoint for embedding
llm_embeddings = SagemakerEndpointEmbeddings(
endpoint_name=<endpoint_name>,
region_name=<region>,
content_handler=content_handler
)

After creating the embeddings, we need to utilize a vectorstore to store the vectors. Vectorstores like Chroma are specially engineered to construct indexes for quick searches in high-dimensional spaces later on, making them perfectly suited for our objectives. As an alternative, you can use FAISS, an open-source vector clustering solution for storing vectors. See the following code:

from langchain.vectorstores import Chroma
# Store vectors in Chroma vectorDB
docsearch_chroma = Chroma.from_documents(texts, llm_embeddings)
# Alternatively you can choose FAISS vectorstore
from langchain.vectorstores import FAISS
docsearch_faiss = FAISS.from_documents(texts, llm_embeddings)

You can also use Amazon Kendra to index enterprise content and produce precise answers. As a fully managed service, Amazon Kendra offers ready-to-use semantic search features for advanced document and passage ranking. With the high-accuracy search in Amazon Kendra, you can obtain the most pertinent content and documents to optimize the quality of your payload. This results in superior LLM responses compared to traditional or keyword-focused search methods. For more information, refer to Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models.

Interactive multilingual voice input

Incorporating interactive voice input into document search offers a myriad of advantages that enhance the user experience. By enabling users to verbally articulate search terms, document search becomes more natural and intuitive, making it simpler and quicker for users to find the information they need. Voice input can bolster the precision of search results, because spoken search terms are less susceptible to spelling or grammatical errors. Interactive voice input renders document search more inclusive, catering to a broader spectrum of users with different language speakers and culture background.

The Amazon Transcribe Streaming SDK enables you to perform audio-to-speech recognition by integrating directly with Amazon Transcribe simply with a stream of audio bytes and a basic handler. As an alternative, you can deploy the whisper-large model locally from Hugging Face using SageMaker, which offers improved data security and better performance. For details, refer to the sample notebook published on the GitHub repo.

# Choose ASR using a locally deployed Whisper-large model from Hugging Face
image = sagemaker.image_uris.retrieve(
framework='pytorch',
region=region,
image_scope='inference',
version='1.12',
instance_type='ml.g4dn.xlarge',
)

model_name = f'sagemaker-soln-whisper-model-{int(time.time())}'
whisper_model_sm = sagemaker.model.Model(
model_data=model_uri,
image_uri=image,
role=sagemaker.get_execution_role(),
entry_point="inference.py",
source_dir='src',
name=model_name,
)

# Audio transcribe
transcribe = whisper_endpoint.predict(audio.numpy())

The above demonstration video shows how voice commands, in conjunction with text input, can facilitate the task of document summarization through interactive conversation.

Guiding NLP tasks through multi-round conversations

Memory in language models maintains a concept of state throughout a user’s interactions. This involves processing a sequence of chat messages to extract and transform knowledge. Memory types vary, but each can be understood using standalone functions and within a chain. Memory can return multiple data points, such as recent messages or message summaries, in the form of strings or lists. This post focuses on the simplest memory form, buffer memory, which stores all prior messages, and demonstrates its usage with modular utility functions and chains.

The LangChain’s ChatMessageHistory class is a crucial utility for memory modules, providing convenient methods to save and retrieve human and AI messages by remembering all previous chat interactions. It’s ideal for managing memory externally from a chain. The following code is an example of applying a simple concept in a chain by introducing ConversationBufferMemory, a wrapper for ChatMessageHistory. This wrapper extracts messages into a variable, allowing them to be represented as a string:

from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(return_messages=True)

LangChain works with many popular LLM providers such as AI21 Labs, OpenAI, Cohere, Hugging Face, and more. For this example, we use a locally deployed AI21 Labs’ Jurassic-2 LLM wrapper using SageMaker. AI21 Studio also provides API access to Jurassic-2 LLMs.

from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import ContentHandlerBase
from langchain.chains.question_answering import load_qa_chain

prompt= PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)

class ContentHandler(ContentHandlerBase):
content_type = "application/json"
accepts = "application/json"
def transform_input(self, prompt: str, model_kwargs: Dict) -- bytes:
input_str = json.dumps({prompt: prompt, **model_kwargs})
return input_str.encode('utf-8')

def transform_output(self, output: bytes) -- str:
response_json = json.loads(output.read().decode("utf-8"))
return response_json[0]["generated_text"]
content_handler = ContentHandler()
llm_ai21=SagemakerEndpoint(
endpoint_name=endpoint_name,
credentials_profile_name=f'aws-credentials-profile-name',
region_name="us-east-1",
model_kwargs={"temperature":0},
content_handler=content_handler)

qa_chain = VectorDBQA.from_chain_type(
llm=llm_ai21,
chain_type='stuff',
vectorstore=docsearch,
verbose=True,
memory=ConversationBufferMemory(return_messages=True)
)

response = qa_chain(
{'query': query_input},
return_only_outputs=True
)

In the event that the process is unable to locate an appropriate response from the original documents in response to a user’s inquiry, the integration of a third-party URL or ideally a task-driven autonomous agent with external data sources significantly enhances the system’s ability to access a vast array of information, ultimately improving context and providing more accurate and current results.

With AI21’s preconfigured Summarize run method, a query can access a predetermined URL, condense its content, and then carry out question and answer tasks based on the summarized information:

# Call AI21 API to query the context of a specific URL for Q&A
ai21.api_key = "<YOUR_API_KEY>"
url_external_source = "<your_source_url>"
response_url = ai21.Summarize.execute(
source=url_external_source,
sourceType="URL" )
context = "<concate_document_and_response_url>"
question = "<query>"
response = ai21.Answer.execute(
context=context,
question=question,
sm_endpoint=endpoint_name,
maxTokens=100,
)

For additional details and code examples, refer to the LangChain LLM integration document as well as the task-specific API documents provided by AI21.

Task automation using BabyAGI

The task automation mechanism allows the system to process complex queries and generate relevant responses, which greatly improves the validity and authenticity of document processing. LangCain’s BabyAGI is a powerful AI-powered task management system that can autonomously create, prioritize, and run tasks. One of the key features is its ability to interface with external sources of information, such as the web, databases, and APIs. One way to use this feature is to integrate BabyAGI with Serpapi, a search engine API that provides access to search engines. This integration allows BabyAGI to search the web for information related to tasks, allowing BabyAGI to access a wealth of information beyond the input documents.

BabyAGI’s autonomous tasking capacity is fueled by an LLM, a vector search database, an API wrapper to external links, and the LangChain framework, allowing it to run a broad spectrum of tasks across various domains. This enables the system to proactively carry out tasks based on user interactions, streamlining the document processing pipeline that incorporates external sources and creating a more efficient, smooth experience. The following diagram illustrates the task automation process.

This process includes the following components:

Memory – The memory stores all the information that BabyAGI needs to complete its tasks. This includes the task itself, as well as any intermediate results or data that BabyAGI has generated.
Execution agent – The execution agent is responsible for carrying out the tasks that are stored in the memory. It does this by accessing the memory, retrieving the relevant information, and then taking the necessary steps to complete the task.
Task creation agent – The task creation agent is responsible for generating new tasks for BabyAGI to complete. It does this by analyzing the current state of the memory and identifying any gaps in knowledge or understanding. When a gap has been identified, the task creation agent generates a new task that will help BabyAGI fill that gap.
Task queue – The task queue is a list of all of the tasks that BabyAGI has been assigned. The tasks are added to the queue in the order in which they were received.
Task prioritization agent – The task prioritization agent is responsible for determining the order in which BabyAGI should complete its tasks. It does this by analyzing the tasks in the queue and identifying the ones that are most important or urgent. The tasks that are most important are placed at the front of the queue, and the tasks that are least important are placed at the back of the queue.

See the following code:

from babyagi import BabyAGI
from langchain.docstore import InMemoryDocstore
import faiss
# Set temperatur=0 to generate the most frequent words, instead of more “poetically free” behavior.
new_query = """
What happened to the First Republic Bank? Will the FED take the same action as it did on SVB's failure?
"""
# Enable verbose logging and use a fixed embedding size.
verbose = True
embedding_size = 1536

# Using FAISS vector cluster for vectore store
index = faiss.IndexFlatL2(embedding_size)
vectorstore = FAISS(llm_embeddings.embed_query, index, InMemoryDocstore({}), {})

# Choose 1 iteration for demo and 1>N>10 for real. If None, it will loop indefinitely
max_iterations: Optional[int] = 2

# Call bayagi class for task automation
baby_agi = BabyAGI.from_llm(
llm=llm_embedding, vectorstore=vectorstore, verbose=verbose, max_iterations=max_iterations<br />)

response = baby_agi({"objective": new_query})

Let’s examine the tasks gathered and their outcomes from a single iteration, used for demonstration purposes, to accomplish the objective in response to the user’s inquiry. BabyAGI operates through a continuous cycle of the following steps:

A task creation agent formulates a new task.
The new task is incorporated into the task queue.
The task prioritization agent establishes the sequence in which tasks should be tackled.
The run agent accomplishes the task.
The task outcome is saved in the memory.
The cycle repeats.

This iterative process enables BabyAGI to learn and develop over time. As it successfully completes an increasing number of tasks, BabyAGI’s knowledge and comprehension of the world expand. Consequently, it can generate and efficiently run more intricate and advanced tasks. By augmenting the iteration count, you can enhance the outcome through the reorganization of the task list, which focuses on the goal and the outcome of preceding tasks, ultimately yielding more pertinent and precise responses.

See the following example output:

*****TASK LIST*****
1: Make a todo list
*****NEXT TASK*****
1: Make a todo list
> Entering new AgentExecutor chain...
Thought: I need to research the First Republic Bank and the FED's action
Action: Search
Action Input: What happened to the First Republic Bank? Will the FED take the same 
action as it did on SVB's failure?
Observation: First Republic's failure was due to a run on deposits following the 
collapse of Silicon Valley Bank and Signature Bank. JPMorgan Chase acquired
First Republic Bank on May 1, 2023. If you had insured or uninsured money in First Republic accounts, your funds are safe and now managed by JPMorgan Chase.
Thought: I need to research the FED's action on SVB's failure
Action: Search
Action Input: What action did the FED take on SVB's failure?
Observation: The Hutchins Center explains how the Federal Reserve has responded to the 
March 2023 failures of Silicon Valley Bank and Signature Bank.
Thought: I now know the final answer
Final Answer: The FED responded to the March 2023 failures of Silicon Valley Bank and <br />Signature Bank by providing liquidity to the banking system. JPMorgan 
Chase acquired First Republic Bank on May 1, 2023, and if you had insured 
or uninsured money in First Republic accounts, your funds are safe and 
now managed by JPMorgan Chase.
> Finished chain.
*****TASK RESULT*****
The Federal Reserve responded to the March 2023 failures of Silicon Valley Bank and Signature Bank by providing liquidity to the banking system. It is unclear what action the FED will take in response to the failure of First Republic Bank.

***TASK LIST***

2: Research the timeline of First Republic Bank's failure.
3: Analyze the Federal Reserve's response to the failure of Silicon Valley Bank and Signature Bank.
4: Compare the Federal Reserve's response to the failure of Silicon Valley Bank and Signature Bank to the Federal Reserve's response to the failure of First Republic Bank.
5: Investigate the potential implications of the Federal Reserve's response to the failure of First Republic Bank.
6: Identify any potential risks associated with the Federal Reserve's response to the failure of First Republic Bank.<br />*****NEXT TASK*****

2: Research the timeline of First Republic Bank's failure.

> Entering new AgentExecutor chain...
Will the FED take the same action as it did on SVB's failure?
Thought: I should search for information about the timeline of First Republic Bank's failure and the FED's action on SVB's failure.
Action: Search
Action Input: Timeline of First Republic Bank's failure and FED's action on SVB's failure
Observation: March 20: The FDIC decides to break up SVB and hold two separate auctions for its traditional deposits unit and its private bank after failing ...
Thought: I should look for more information about the FED's action on SVB's failure.
Action: Search
Action Input: FED's action on SVB's failure
Observation: The Fed blamed failures on mismanagement and supervisory missteps, compounded by a dose of social media frenzy.
Thought: I now know the final answer.
Final Answer: The FED is likely to take similar action on First Republic Bank's failure as it did on SVB's failure, which was to break up the bank and hold two separate auctions for its traditional deposits unit and its private bank.</p><p>&gt; Finished chain.

*****TASK RESULT*****
The FED responded to the March 2023 failures of ilicon Valley Bank and Signature Bank 
by providing liquidity to the banking system. JPMorgan Chase acquired First Republic 
Bank on May 1, 2023, and if you had insured or uninsured money in First Republic 
accounts, your funds are safe and now managed by JPMorgan Chase.*****TASK ENDING*****

With BabyAGI for task automation, the dialogue-guided IDP system showcased its effectiveness by going beyond the original document’s context to address the user’s query about the Federal Reserve’s potential actions concerning the First Republic Bank’s failure, which occurred in late April 2023, 1 month after the sample publication, in comparison to SVB’s failure. To achieve this, the system generated a to-do list and completed tasks sequentially. It investigated the circumstances surrounding the First Republic Bank’s failure, pinpointed potential risks tied to the Federal Reserve’s response, and compared it to the response to SVB’s failure.

Although BabyAGI remains a work in progress, it carries the promise of revolutionizing machine interactions, inventive thinking, and problem resolution. As BabyAGI’s learning and enhancement persist, it will be capable of producing more precise, insightful, and inventive responses. By empowering machines to learn and evolve autonomously, BabyAGI could facilitate their assistance in a broad spectrum of tasks, ranging from mundane chores to intricate problem-solving.

Constraints and limitations

Dialogue-guided IDP offers a promising approach to enhancing the efficiency and effectiveness of document analysis and extraction. However, we must acknowledge its current constraints and limitations, such as the need for data bias avoidance, hallucination mitigation, the challenge of handling complex and ambiguous language, and difficulties in understanding context or maintaining coherence in longer conversations.

Additionally, it’s important to consider confabulations and hallucinations in AI-generated responses, which may lead to the creation of inaccurate or fabricated information. To address these challenges, ongoing developments are focusing on refining LLMs with better natural language understanding capabilities, incorporating domain-specific knowledge and developing more robust context-aware models. Building an LLM from scratch can be costly and time-consuming; however, you can employ several strategies to improve existing models:

Fine-tuning a pre-trained LLM on specific domains for more accurate and relevant outputs
Integrating external data sources known to be safe during inference for enhanced contextual understanding
Designing better prompts to elicit more precise responses from the model
Using ensemble models to combine outputs from multiple LLMs, averaging out errors and minimizing hallucination chances
Building guardrails to prevent models from veering off into undesired areas while ensuring apps respond with accurate and appropriate information
Conducting supervised fine-tuning with human feedback, iteratively refining the model for increased accuracy and reduced hallucination.

By adopting these approaches, AI-generated responses can be made more reliable and valuable.

The task-driven autonomous agent offers significant potential across various applications, but it is vital to consider key risks before adopting the technology. These risks include:

Data privacy and security breaches due to reliance on the selected LLM provider and vectorDB
Ethical concerns arising from biased or harmful content generation
Dependence on model accuracy, which may lead to ineffective task completion or undesired results
System overload and scalability issues if task generation outpaces completion, requiring proper task sequencing and parallel management
Misinterpretation of task prioritization based on the LLM’s understanding of task importance
The authenticity of the data it received from the web

Addressing these risks is crucial for responsible and successful application, allowing us to maximize the benefits of AI-powered language models while minimizing potential risks.

Conclusions

The dialogue-guided solution for IDP presents a groundbreaking approach to document processing by integrating OCR, automatic speech recognition, LLMs, task automation, and external data sources. This comprehensive solution enables businesses to streamline their document processing workflows, making them more efficient and intuitive. By incorporating these cutting-edge technologies, organizations can not only revolutionize their document management processes, but also bolster decision-making capabilities and considerably boost overall productivity. The solution offers a transformative and innovative means for businesses to unlock the full potential of their document workflows, ultimately driving growth and success in the era of generative AI. Refer to SageMaker Jumpstart for other solutions and Amazon Bedrock for additional generative AI models.

The authors would like to sincerely express their appreciation to Ryan Kilpatrick, Ashish Lal, and Kristine Pearce for their valuable inputs and contributions to this work. They also acknowledge Clay Elmore for the code sample provided on Github.

About the authors

Alfred Shen is a Senior AI/ML Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on CV, NLP, and multimodality. His work has been showcased in publications such as EMNLP, ICLR, and Public Health.

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Dr. Li Zhang is a Principal Product Manager-Technical for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms, a service that helps data scientists and machine learning practitioners get started with training and deploying their models, and uses reinforcement learning with Amazon SageMaker. His past work as a principal research staff member and master inventor at IBM Research has won the test of time paper award at IEEE INFOCOM.

Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, mentoring college students for entrepreneurship, and spending time with friends and families.