Efficient and cost-effective multi-tenant LoRA serving with Amazon SageMaker

Efficient and cost-effective multi-tenant LoRA serving with Amazon SageMaker

In the rapidly evolving landscape of artificial intelligence (AI), the rise of generative AI models has ushered in a new era of personalized and intelligent experiences. Organizations are increasingly using the power of these language models to drive innovation and enhance their services, from natural language processing to content generation and beyond.

Using generative AI models in the enterprise environment, however, requires taming their intrinsic power and enhancing their skills to address specific customer needs. In cases where an out-of-the-box model is missing knowledge of domain- or organization-specific terminologies, a custom fine-tuned model, also called a domain-specific large language model (LLM), might be an option for performing standard tasks in that domain or micro-domain. BloombergGPT is an example of LLM that was trained from scratch to have a better understanding of highly specialized vocabulary found in the financial domain. In the same sense, domain specificity can be addressed through fine-tuning at a smaller scale. Customers are fine-tuning generative AI models based on domains including finance, sales, marketing, travel, IT, HR, finance, procurement, healthcare and life sciences, customer service, and many more. Additionally, independent software vendors (ISVs) are building secure, managed, multi-tenant, end-to-end generative AI platforms with models that are customized and personalized based on their customer’s datasets and domains. For example, Forethought introduced SupportGPT, a generative AI platform for customer support.

As the demands for personalized and specialized AI solutions grow, businesses often find themselves grappling with the challenge of efficiently managing and serving a multitude of fine-tuned models across diverse use cases and customer segments. With the need to serve a wide range of AI-powered use cases, from resume parsing and job skill matching, domain-specific to email generation and natural language understanding, these businesses are often left with the daunting task of managing hundreds of fine-tuned models, each tailored to specific customer needs or use cases. The complexities of this challenge are compounded by the inherent scalability and cost-effectiveness concerns that come with deploying and maintaining such a diverse model ecosystem. Traditional approaches to model serving can quickly become unwieldy and resource intensive, leading to increased infrastructure costs, operational overhead, and potential performance bottlenecks.

Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage and switching cost for hosting independent instances for different tasks. LoRA (Low-Rank Adaptation) is an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining high model quality. Importantly, it allows for quick task switching when deployed as a service by sharing the vast majority of the model parameters.

In this post, we explore a solution that addresses these challenges head-on using LoRA serving with Amazon SageMaker. By using the new performance optimizations of LoRA techniques in SageMaker large model inference (LMI) containers along with inference components, we demonstrate how organizations can efficiently manage and serve their growing portfolio of fine-tuned models, while optimizing costs and providing seamless performance for their customers.

The latest SageMaker LMI container offers unmerged-LoRA inference, sped up with our LMI-Dist inference engine and OpenAI style chat schema. To learn more about LMI, refer to LMI Starting Guide, LMI handlers Inference API Schema, and Chat Completions API Schema.

New LMI features for serving LoRA adapters at scale on SageMaker

There are two kinds of LoRA that can be put onto various engines:

  • Merged LoRA – This applies the adapter by modifying the base model in place. It has zero added latency while running, but has a cost to apply or unapply the merge. It works best for cases with only a few adapters. It is best for single-adapter batches, and doesn’t support multi-adapter batches.
  • Unmerged LoRA – This alters the model operators to factor in the adapters without changing the base model. It has a higher inference latency for the additional adapter operations. However, it does support multi-adapter batches. It works best for use cases with a large number of adapters.

The new LMI container offers out-of-box integration and abstraction with SageMaker for hosting multiple unmerged LoRA adapters with higher performance (low latency and high throughput) using the vLLM backend LMI-Dist backend that uses vLLM, which in-turn uses S-LORA and Punica. The LMI container offers two backends for serving LoRA adapters: the LMI-Dist backend (recommended) and the vLLM Backend. Both backends are based on the open source vLLM library for serving LoRA adapters, but the LMI-Dist backend provides additional optimized continuous (rolling) batching implementation. You are not required to configure these libraries separately; the LMI container provides the higher-level abstraction through the vLLM and LMI-Dist backends. We recommend you start with the LMI-Dist backend because it has additional performance optimizations related to continuous (rolling) batching.

S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes unified paging. Unified paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead.

Punica is designed to efficiently serve multiple LoRA models on a shared GPU cluster. It achieves this by following three design guidelines:

  • Consolidating multi-tenant LoRA serving workloads to a small number of GPUs to increase overall GPU utilization
  • Enabling batching for different LoRA models to improve performance and GPU utilization
  • Focusing on the decode stage performance, which is the predominant factor in the cost of model serving

Punica uses a new CUDA kernel design called Segmented Gather Matrix-Vector Multiplication (SGMV) to batch GPU operations for concurrent runs of multiple LoRA models, significantly improving GPU efficiency in terms of memory and computation. Punica also implements a scheduler that routes requests to active GPUs and migrates requests for consolidation, optimizing GPU resource allocation. Overall, Punica achieves high throughput and low latency in serving multi-tenant LoRA models on a shared GPU cluster. For more information, read the Punica whitepaper.

The following figure shows the multi LoRA adapter serving stack of the LMI container on SageMaker.

As shown in the preceding figure, the LMI container provides the higher-level abstraction through the vLLM and LMI-Dist backends to serve LoRA adapters at scale on SageMaker. As a result, you’re not required to configure the underlying libraries (S-LORA, Punica, or vLLM) separately. However, there might be cases where you want to control some of the performance driving parameters depending on your use case and application performance requirements. The following are the common configuration options the LMI container provides to tune LoRA serving. For more details on configuration options specific to each backend, refer to vLLM Engine User Guide and LMI-Dist Engine User Guide.

option.enable_lora: This config enables support for LoRA adapters.
option.max_loras: This config determines the maximum number of LoRA adapters that can be run at once. Allocates GPU memory for those number adapters.
option.max_lora_rank: This config determines the maximum rank allowed for a LoRA adapter. Set this value to maximum rank of your adapters. Setting a larger value will enable more adapters at a greater memory usage cost.
option.lora_extra_vocab_size: This config determines the maximum additional vocabulary that can be added through a LoRA adapter.
option.max_cpu_loras: This config determines the maximum number of LoRA adapters to cache in memory. All others will be evicted to disk.

Design patterns for serving fine-tuned LLMs at scale

Enterprises grappling with the complexities of managing generative AI models often encounter scenarios where a robust and flexible design pattern is crucial. One common use case involves a single base model with multiple LoRA adapters, each tailored to specific customer needs or use cases. This approach allows organizations to use a foundational language model while maintaining the agility to fine-tune and deploy customized versions for their diverse customer base.

Single-base model with multiple fine-tuned LoRA adapters

An enterprise offering a resume parsing and job skill matching service may use a single high-performance base model, such as Mistral 7B. The Mistral 7B base model is particularly well-suited for job-related content generation tasks, such as creating personalized job descriptions and tailored email communications. Mistral’s strong performance in natural language generation and its ability to capture industry-specific terminology and writing styles make it a valuable asset for such an enterprise’s customers in the HR and recruitment space. By fine-tuning Mistral 7B with LoRA adapters, enterprises can make sure the generated content aligns with the unique branding, tone, and requirements of each customer, delivering a highly personalized experience.

Multi-base models with multiple fine-tuned LoRA adapters

On the other hand, the same enterprise may use the Llama 3 base model for more general natural language processing tasks, such as resume parsing, skills extraction, and candidate matching. Llama 3’s broad knowledge base and robust language understanding capabilities enable it to handle a wide range of documents and formats, making sure their services can effectively process and analyze candidate information, regardless of the source. By fine-tuning Llama 3 with LoRA adapters, such enterprises can tailor the model’s performance to specific customer requirements, such as regional dialects, industry-specific terminology, or unique data formats. By employing a multi-base model, multi-adapter design pattern, enterprises can take advantage of the unique strengths of each language model to deliver a comprehensive and highly personalized job profile to a candidate resume matching service. This approach allows enterprises to cater to the diverse needs of their customers, making sure each client receives tailored AI-powered solutions that enhance their recruitment and talent management processes.

Effectively implementing and managing these design patterns, where multiple base models are coupled with numerous LoRA adapters, is a key challenge that enterprises must address to unlock the full potential of their generative AI investments. A well-designed and scalable approach to model serving is crucial in delivering cost-effective, high-performance, and personalized experiences to customers.

Solution overview

The following sections outline the coding steps to deploy a base LLM, TheBloke/Llama-2-7B-Chat-fp16, with LoRA adapters on SageMaker. It involves preparing a compressed archive with the base model files and LoRA adapter files, uploading it to Amazon Simple Storage Service (Amazon S3), selecting and configuring the SageMaker LMI container to enable LoRA support, creating a SageMaker endpoint configuration and endpoint, defining an inference component for the model, and sending inference requests specifying different LoRA adapters like Spanish (“es”) and French (“fr”) in the request payload to use those fine-tuned language capabilities. For more information on deploying models using SageMaker inference components, see Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency.

To showcase multi-base models with their LoRA adapters, we add another base model, mistralai/Mistral-7B-v0.1, and its LoRA adapter to the same SageMaker endpoint, as shown in the following diagram.

Prerequisites

You need to complete some prerequisites before you can run the notebook:

Upload your LoRA adapters to Amazon S3

To prepare the LoRA adapters, create a adapters.tar.gz compressed archive containing the LoRA adapters directory. The adapters directory should contain subdirectories for each of the LoRA adapters, with each adapter subdirectory containing the adapter_model.bin file (the adapter weights) and the adapter_config.json file (the adapter configuration). We typically obtain these adapter files by using the PeftModel.save_pretrained() method from the Peft library. After you assemble the adapters directory with the adapter files, you compress it into a adapters.tar.gz archive and upload it to an S3 bucket for deployment or sharing. We include the LoRA adapters in the adapters directory as follows:

|- model_dir
    |- adapters/
        |--- <adapter_1>/
        |--- <adapter_2>/
        |--- ...
        |--- <adapter_n>/

Download LoRA adapters, compress them, and upload the compressed file to Amazon S3:

snapshot_download("UnderstandLing/llama-2-7b-chat-es", local_dir="llama-lora-multi-adapter/adapters/es", local_dir_use_symlinks=False)
snapshot_download("UnderstandLing/llama-2-7b-chat-fr", local_dir="llama-lora-multi-adapter/adapters/fr", local_dir_use_symlinks=False)
snapshot_download("UnderstandLing/llama-2-7b-chat-ru", local_dir="llama-lora-multi-adapter/adapters/ru", local_dir_use_symlinks=False)
!tar czvf adapters.tar.gz -C llama-lora-multi-adapter .
s3_code_artifact_accelerate = sess.upload_data("adapters.tar.gz", model_bucket, s3_code_prefix)

Select and LMI container and configure LMI to enable LoRA

SageMaker provides optimized containers for LMI that support different frameworks for model parallelism, allowing the deployment of LLMs across multiple GPUs. For this post, we employ the DeepSpeed container, which encompasses frameworks such as DeepSpeed and vLLM, among others. See the following code:

deepspeed_image_uri = image_uris.retrieve(
    framework="djl-deepspeed",
    region=sess.boto_session.region_name,
    version="0.27.0"
)

env_generation = {"OPTION_MODEL_ID": "TheBloke/Llama-2-7B-Chat-fp16",
                  "OPTION_TRUST_REMOTE_CODE": "true",
                  "OPTION_TENSOR_PARALLEL_DEGREE": "2",
                  "OPTION_ROLLING_BATCH": "lmi-dist",
                  "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
                  "OPTION_DTYPE": "fp16",
                  "OPTION_ENABLE_LORA": "true",
                  "OPTION_GPU_MEMORY_UTILIZATION": "0.8",
                  "OPTION_MAX_LORA_RANK": "64",
                  "OPTION_MAX_CPU_LORAS": "4"
                 }

Create a SageMaker endpoint configuration

Create an endpoint configuration using the appropriate instance type. Set ContainerStartupHealthCheckTimeoutInSeconds to account for the time taken to download the LLM weights from Amazon S3 or the model hub, and the time taken to load the model on the GPUs:

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": initial_instance_count,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": initial_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {
                'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
            },
        },
    ],
)

Create a SageMaker endpoint

Create a SageMaker endpoint based on the endpoint configuration defined in the previous step. You use this endpoint for hosting the inference component (model) inference and make invocations.

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)

Create a SageMaker inference component (model)

Now that you have created a SageMaker endpoint, let’s create our model as an inference component. The SageMaker inference component enables you to deploy one or more foundation models (FMs) on the same SageMaker endpoint and control how many accelerators and how much memory is reserved for each FM. See the following code:

model_name = sagemaker.utils.name_from_base("lmi-llama2-7b")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env_generation,
        "ModelDataUrl": s3_code_artifact_accelerate,
    }
)

prefix = sagemaker.utils.unique_name_from_base("lmi-llama2-7b")
inference_component_name = f"{prefix}-inference-component"

sm_client.create_inference_component(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name,
        # "Container": {
        #     "Image": inference_image_uri,
        #     "ArtifactUrl": s3_code_artifact,
        # },
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": 1200,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1200,
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 2,
            "MinMemoryRequiredInMb": 7*2*1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

Make inference requests using different LoRA adapters

With the endpoint and inference model ready, you can now send requests to the endpoint using the LoRA adapters you fine-tuned for Spanish and French languages. The specific LoRA adapter is specified in the request payload under the "adapters" field. We use "es" for the Spanish language adapter and "fr" for the French language adapter, as shown in the following code:

# Testing Spanish (es) adapter
response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Piensa en una excusa creativa para decir que no necesito ir a la fiesta."],
                     "adapters": ["es"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

# Testing French (fr) adapter
response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Pensez à une excuse créative pour dire que je n'ai pas besoin d'aller à la fête."],
                     "adapters": ["fr"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

# Testing Russian (ru) adapter
response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Придумайте креативное "],
                     "parameters": params,
                     "adapters": ["ru"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

Add another base model and inference component and its LoRA adapter

Let’s add another base model and its LoRA adapter to the same SageMaker endpoint for multi-base models with multiple fine-tuned LoRA adapters. The code is very similar to the previous code for creating the Llama base model and its LoRA adapter.

Configure the SageMaker LMI container to host the base model (mistralai/Mistral-7B-v0.1) and its LoRA adapter (mistral-lora-multi-adapter/adapters/fr):

deepspeed_image_uri = image_uris.retrieve(
    framework="djl-deepspeed",
    region=sess.boto_session.region_name,
    version="0.27.0"
)

my_hf_token = "<YOUR_HuggingFacePersonalAccessToken_HERE>"

env_generation = {"HF_TOKEN": my_hf_token,
                  "OPTION_MODEL_ID": "mistralai/Mistral-7B-v0.1",
                  "OPTION_TRUST_REMOTE_CODE": "true",
                  "OPTION_TENSOR_PARALLEL_DEGREE": "2",
                  "OPTION_ENABLE_LORA": "true",
                  "OPTION_GPU_MEMORY_UTILIZATION": "0.8",
                  "OPTION_MAX_LORA_RANK": "64",
                  "OPTION_MAX_CPU_LORAS": "4"
                 }

Create a new SageMaker model and inference component for the base model (mistralai/Mistral-7B-v0.1) and its LoRA adapter (mistral-lora-multi-adapter/adapters/fr):

model_name2 = sagemaker.utils.name_from_base("lmi-mistral-7b")

create_model_response = sm_client.create_model(
    ModelName=model_name2,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env,
        "ModelDataUrl": s3_code_artifact_accelerate,
    }
)

sm_client.create_inference_component(
    InferenceComponentName=inference_component_name2,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name2,
        # "Container": {
        #     "Image": inference_image_uri,
        #     "ArtifactUrl": s3_code_artifact,
        # },
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1200,
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 2,
            "MinMemoryRequiredInMb": 7*2*1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

Invoke the same SageMaker endpoint for the newly created inference component for the base model (mistralai/Mistral-7B-v0.1) and its LoRA adapter (mistral-lora-multi-adapter/adapters/fr):

# Testing French (fr) adapter
response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name2,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Pensez à une excuse créative pour dire que je n'ai pas besoin d'aller à la fête."],
                     "adapters": ["fr"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

Clean up

Delete the SageMaker inference components, models, endpoint configuration, and endpoint to avoid incurring unnecessary costs:

sm_client.delete_inference_component(InferenceComponentName=inference_component_name)
sm_client.delete_inference_component(InferenceComponentName=inference_component_name2)
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)
sm_client.delete_model(ModelName=model_name2)

Conclusion

The ability to efficiently manage and serve a diverse portfolio of fine-tuned generative AI models is paramount if you want your organization to deliver personalized and intelligent experiences at scale in today’s rapidly evolving AI landscape. With the inference capabilities of SageMaker LMI coupled with the performance optimizations of LoRA techniques, you can overcome the challenges of multi-tenant fine-tuned LLM serving. This solution enables you to consolidate AI workloads, batch operations across multiple models, and optimize resource utilization for cost-effective, high-performance delivery of tailored AI solutions to your customers. As demand for specialized AI experiences continues to grow, we’ve shown how the scalable infrastructure and cutting-edge model serving techniques of SageMaker position AWS as a powerful platform for unlocking generative AI’s full potential. To start exploring the benefits of this solution for yourself, we encourage you to use the code example and resources we’ve provided in this post.


About the authors

Michael Nguyen is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

Dhawal PatelDhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Vivek Gangasani is a AI/ML Startup Solutions Architect for Generative AI startups at AWS. He helps emerging GenAI startups build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Read More

Mixtral 8x22B is now available in Amazon SageMaker JumpStart

Mixtral 8x22B is now available in Amazon SageMaker JumpStart

Today, we are excited to announce the Mixtral-8x22B large language model (LLM), developed by Mistral AI, is available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models so you can quickly get started with ML. In this post, we walk through how to discover and deploy the Mixtral-8x22B model.

What is Mixtral 8x22B

Mixtral 8x22B is Mistral AI’s latest open-weights model and sets a new standard for performance and efficiency of available foundation models, as measured by Mistral AI across standard industry benchmarks. It is a sparse Mixture-of-Experts (SMoE) model that uses only 39 billion active parameters out of 141 billion, offering cost-efficiency for its size. Continuing with Mistral AI’s belief in the power of publicly available models and broad distribution to promote innovation and collaboration, Mixtral 8x22B is released under Apache 2.0, making the model available for exploring, testing, and deploying. Mixtral 8x22B is an attractive option for customers selecting between publicly available models and prioritizing quality, and for those wanting a higher quality from mid-sized models, such as Mixtral 8x7B and GPT 3.5 Turbo, while maintaining high throughput.

Mixtral 8x22B provides the following strengths:

  • Multilingual native capabilities in English, French, Italian, German, and Spanish languages
  • Strong mathematics and coding capabilities
  • Capable of function calling that enables application development and tech stack modernization at scale
  • 64,000-token context window that allows precise information recall from large documents

About Mistral AI

Mistral AI is a Paris-based company founded by seasoned researchers from Meta and Google DeepMind. During his time at DeepMind, Arthur Mensch (Mistral CEO) was a lead contributor on key LLM projects such as Flamingo and Chinchilla, while Guillaume Lample (Mistral Chief Scientist) and Timothée Lacroix (Mistral CTO) led the development of LLaMa LLMs during their time at Meta. The trio are part of a new breed of founders who combine deep technical expertise and operating experience working on state-of-the-art ML technology at the largest research labs. Mistral AI has championed small foundational models with superior performance and commitment to model development. They continue to push the frontier of artificial intelligence (AI) and make it accessible to everyone with models that offer unmatched cost-efficiency for their respective sizes, delivering an attractive performance-to-cost ratio. Mixtral 8x22B is a natural continuation of Mistral AI’s family of publicly available models that include Mistral 7B and Mixtral 8x7B, also available on SageMaker JumpStart. More recently, Mistral launched commercial enterprise-grade models, with Mistral Large delivering top-tier performance and outperforming other popular models with native proficiency across multiple languages.

What is SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment. You can now discover and deploy Mixtral-8x22B with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, providing data encryption at rest and in-transit.

SageMaker also adheres to standard security frameworks such as ISO27001 and SOC1/2/3 in addition to complying with various regulatory requirements. Compliance frameworks like General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA), and Payment Card Industry Data Security Standard (PCI DSS) are supported to make sure data handling, storing, and process meet stringent security standards.

SageMaker JumpStart availability is dependent on the model; Mixtral-8x22B v0.1 is currently supported in the US East (N. Virginia) and US West (Oregon) AWS Regions.

Discover models

You can access Mixtral-8x22B foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

From the SageMaker JumpStart landing page, you can search for “Mixtral” in the search box. You will see search results showing Mixtral 8x22B Instruct, various Mixtral 8x7B models, and Dolphin 2.5 and 2.7 models.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You will also find the Deploy button, which you can use to deploy the model and create an endpoint.

SageMaker has seamless logging, monitoring, and auditing enabled for deployed models with native integrations with services like AWS CloudTrail for logging and monitoring to provide insights into API calls and Amazon CloudWatch to collect metrics, logs, and event data to provide information into the model’s resource utilization.

Deploy a model

Deployment starts when you choose Deploy. After deployment finishes, an endpoint has been created. You can test the endpoint by passing a sample inference request payload or selecting your testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in your preferred notebook editor in SageMaker Studio. This will require an AWS Identity and Access Management (IAM) role and policy attached to it to restrict model access. Additionally, if you choose to deploy the model endpoint within SageMaker Studio, you will be prompted to choose an instance type, initial instance count, and maximum instance count. The ml.p4d.24xlarge and ml.p4de.24xlarge instance types are the only instance types currently supported for Mixtral 8x22B Instruct v0.1.

To deploy using the SDK, we start by selecting the Mixtral-8x22b model, specified by the model_id with value huggingface-llm-mistralai-mixtral-8x22B-instruct-v0-1. You can deploy any of the selected models on SageMaker with the following code. Similarly, you can deploy Mixtral-8x22B instruct using its own model ID.

from sagemaker.jumpstart.model import JumpStartModel model = JumpStartModel(model_id=""huggingface-llm-mistralai-mixtral-8x22B-instruct-v0-1") predictor = model.deploy()

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel.

After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {"inputs": "Hello!"} 
predictor.predict(payload)

Example prompts

You can interact with a Mixtral-8x22B model like any standard text generation model, where the model processes an input sequence and outputs predicted next words in the sequence. In this section, we provide example prompts.

Mixtral-8x22b Instruct

The instruction-tuned version of Mixtral-8x22B accepts formatted instructions where conversation roles must start with a user prompt and alternate between user instruction and assistant (model answer). The instruction format must be strictly respected, otherwise the model will generate sub-optimal outputs. The template used to build a prompt for the Instruct model is defined as follows:

<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]]

<s> and </s> are special tokens for beginning of string (BOS) and end of string (EOS), whereas [INST] and [/INST] are regular strings.

The following code shows how you can format the prompt in instruction format:

from typing import Dict, List

def format_instructions(instructions: List[Dict[str, str]]) -> List[str]:
    """Format instructions where conversation roles must alternate user/assistant/user/assistant/..."""
    prompt: List[str] = []
    for user, answer in zip(instructions[::2], instructions[1::2]):
        prompt.extend(["<s>", "[INST] ", (user["content"]).strip(), " [/INST] ", (answer["content"]).strip(), "</s>"])
    prompt.extend(["<s>", "[INST] ", (instructions[-1]["content"]).strip(), " [/INST] ","</s>"])
    return "".join(prompt)


def print_instructions(prompt: str, response: str) -> None:
    bold, unbold = '33[1m', '33[0m'
    print(f"{bold}> Input{unbold}n{prompt}nn{bold}> Output{unbold}n{response[0]['generated_text']}n")

Summarization prompt

You can use the following code to get a response for a summarization:

instructions = [{"role": "user", "content": """Summarize the following information. Format your response in short paragraph.

Article:

Contextual compression - To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction.
"""}]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 1500}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

The following is an example of the expected output:

> > Input
<s>[INST] Summarize the following information. Format your response in short paragraph.

Article:

Contextual compression - To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction. [/INST] </s>
> Output
<s>[INST] Summarize the following information. Format your response in short paragraph.

Article:

Contextual compression - To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction. [/INST] </s>Contextual compression is a technique used to manage the issue of context overflow in information retrieval. This method involves compressing and filtering the retrieved documents to align with the query's context, ensuring that only relevant information is processed. This is achieved through a two-step process: a base retriever fetches the initial set of documents, and a document compressor refines these documents by either reducing their content or excluding them entirely based on relevance.

The contextual compression retriever significantly improves the efficiency of Retrieval-Augmented Generation (RAG) applications by extracting and utilizing only the essential information from a large pool of data. This approach addresses the problem of information overload and irrelevant data processing, leading to improved response quality, cost-effective operations, and a smoother retrieval process. In essence, contextual compression acts as a filter that tailors the information to the specific query, making it an indispensable tool for developers aiming to optimize their RAG applications for better performance and user satisfaction.

Multilingual translation prompt

You can use the following code to get a response for a multilingual translation:

Prompt

instructions = [{"role": "user", "content": """
<You are a multilingual assistant. Translate the following sentences in the order in which they are presented into French, German, and Spanish. Make sure to label each section as French, German, and Spanish. [/INST]

1.	Customer: "I recently ordered a set of wireless headphones, but I received a different model. What steps should I take to receive the correct product I ordered?"
2.	Customer: "I purchased a customizable laptop last month and opted for specific upgrades. However, the laptop's performance isn't as expected. Can I have a technician look into it, or should I consider returning it?"
3.	Customer: "My order for a designer handbag was supposed to include a matching wallet as part of a promotional deal, but the wallet was not in the package. How can this issue be resolved?"
4.	Customer: "I see that the tracking information for my order of ceramic cookware shows it was delivered, but I haven't received it. Could you assist in determining where my package might be?"
5.	Customer: "I'm trying to buy an antique mirror from your vintage collection, but the website keeps giving me an error when I try to check out. Is there another way to complete my purchase?" 
"""}]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 2000, "do_sample": True}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

The following is an example of the expected output:

> Input
<s>[INST] <You are a multilingual assistant. Translate the following sentences in the order in which they are presented into French, German, and Spanish. Make sure to label each section as French, German, and Spanish. [/INST]


1. Customer: "I recently ordered a set of wireless headphones, but I received a different model. What steps should I take to receive the correct product I ordered?"
2. Customer: "I purchased a customizable laptop last month and opted for specific upgrades. However, the laptop's performance isn't as expected. Can I have a technician look into it, or should I consider returning it?"
3. Customer: "My order for a designer handbag was supposed to include a matching wallet as part of a promotional deal, but the wallet was not in the package. How can this issue be resolved?"
4. Customer: "I see that the tracking information for my order of ceramic cookware shows it was delivered, but I haven't received it. Could you assist in determining where my package might be?"
5. Customer: "I'm trying to buy an antique mirror from your vintage collection, but the website keeps giving me an error when I try to check out. Is there another way to complete my purchase?" [/INST] </s>
> Output
<s>[INST] <You are a multilingual assistant. Translate the following sentences in the order in which they are presented into French, German, and Spanish. Make sure to label each section as French, German, and Spanish. [/INST]


1. Customer: "I recently ordered a set of wireless headphones, but I received a different model. What steps should I take to receive the correct product I ordered?"
2. Customer: "I purchased a customizable laptop last month and opted for specific upgrades. However, the laptop's performance isn't as expected. Can I have a technician look into it, or should I consider returning it?"
3. Customer: "My order for a designer handbag was supposed to include a matching wallet as part of a promotional deal, but the wallet was not in the package. How can this issue be resolved?"
4. Customer: "I see that the tracking information for my order of ceramic cookware shows it was delivered, but I haven't received it. Could you assist in determining where my package might be?"
5. Customer: "I'm trying to buy an antique mirror from your vintage collection, but the website keeps giving me an error when I try to check out. Is there another way to complete my purchase?" [/INST] </s>

French:

1. Client : "J'ai récemment commandé un casque audio sans fil, mais j'ai reçu un modèle différent. Quelles sont les étapes à suivre pour recevoir le produit correct que j'ai commandé ?"
2. Client : "J'ai acheté un ordinateur portable personnalisable le mois dernier et j'ai opté pour des mises à niveau spécifiques. Cependant, les performances de l'ordinateur portable ne sont pas à la hauteur de mes attentes. Puis-je avoir un technicien qui vérifie cela, ou devrais-je envisager de le retourner ?"
3. Client : "Ma commande pour un sac à main de designer devait inclure un portefeuille assorti dans le cadre d'une offre promotionnelle, mais le portefeuille ne se trouvait pas dans le paquet. Comment puis-je résoudre ce problème ?"
4. Client : "Je vois que les informations de suivi de ma commande de batterie de cuisine en céramique indiquent qu'elle a été livrée, mais je ne l'ai pas reçue. Pourriez-vous m'aider à déterminer où se trouve mon colis ?"
5. Client : "J'essaie d'acheter un miroir antique de votre collection vintage, mais le site continue de me donner une erreur lorsque j'essaie de passer à la caisse. Existe-t-il un autre moyen de finaliser mon achat ?"

German:

1. Kunde: "Ich habe kürzlich ein Set kabelloser Kopfhörer bestellt, aber ich habe ein anderes Modell erhalten. Welche Schritte sollte ich unternehmen, um das richtige Produkt zu erhalten, das ich bestellt habe?"
2. Kunde: "Ich habe letzten Monat einen anpassbaren Laptop gekauft und habe mich für spezifische Upgrades entschieden. Allerdings entspricht die Leistung des Laptops nicht meinen Erwartungen. Kann ich einen Techniker hinzuziehen lassen oder sollte ich eine Rückgabe in Erwägung ziehen?"
3. Kunde: "Meine Bestellung für eine Designer-Handtasche sollte inklusive eines passenden Portemonnaies als Teil einer Werbeaktion sein, aber das Portemonnaie war nicht im Paket. Wie kann dieses Problem gelöst werden?"
4. Kunde: "Ich sehe, dass die Sendungsverfolgungsinformationen für meine Bestellung von Keramik-Kochgeschirr anzeigen, dass es geliefert wurde, aber ich habe es nicht erhalten. Könnten Sie mir dabei helfen, festzustellen, wo sich mein Paket befindet?"
5. Kunde: "Ich versuche ein Antikspiegel aus Ihrer Vintage-Kollektion zu kaufen, aber die Website gibt mir jedes Mal einen Fehler, wenn ich versuche, den Bestellvorgang abzuschließen. Gibt es einen anderen Weg, meinen Kauf abzuschließen?"

Spanish:

1. Cliente: "Recientemente ordené un conjunto de audífonos inalámbricos, pero recibí un modelo diferente. ¿Cuáles son los pasos que debo seguir para recibir el producto correcto que ordené?"
2. Cliente: "Compré una computadora personalizable el mes pasado y opté por actualizaciones específicas. Sin embargo, el rendimiento de la computadora no está a la altura de mis expectativas. ¿Puedo tener un técnico que revise esto, o debería considerar devolverla?"
3. Cliente: "Mi pedido de un bolso de diseñador debería haber incluido una billetera a juego como parte de una oferta promocional, pero la billetera no estaba en el paquete. ¿Cómo puedo resolver este problema?"
4. Cliente: "Veo que la información de seguimiento de mi pedido de utensilios de cocina de cerámica indica que ha sido entregado, pero aún no lo he recibido. ¿Podría ayudarme a determinar dónde se encuentra mi paquete?"
5. Cliente: "Estoy tratando de comprar un espejo antiguo de su colección de vintage, pero el sitio sigue dándome un error cada vez que trato de realizar el pago. ¿Hay otra forma de completar mi compra?"

Code generation

You can use the following code to get a response for code generation:

instructions = [{"role": "user", "content": "Write a program on Dijstra's algorithm using a priority queue in python:"}]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 1000}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

You get the following output:

> Input
<s>[INST] Write a program on Dijstra's algorithm using a priority queue in python: [/INST] </s>

> Output
<s>[INST] Write a program on Dijstra's algorithm using a priority queue in python: [/INST] </s>Here is a Python implementation of Dijkstra's algorithm using a priority queue (heapq module):
```python
import heapq

def dijkstra(graph, start):
    # Initialize the priority queue with the starting vertex and its distance (0)
    priority_queue = [(0, start)]
    # Initialize the distances dictionary with infinite distances for all vertices except the starting vertex
    distances = {vertex: float('inf') for vertex in graph}
    distances[start] = 0
    # Initialize the previous dictionary to keep track of the shortest path
    previous = {vertex: None for vertex in graph}

    while priority_queue:
        # Get the vertex with the smallest distance from the priority queue
        current_distance, current_vertex = heapq.heappop(priority_queue)

        # If the current distance is greater than the stored distance, skip this vertex
        if current_distance > distances[current_vertex]:
            continue

        # Update the distances of the adjacent vertices
        for neighbor, weight in graph[current_vertex].items():
            distance = current_distance + weight
            # If the new distance is smaller than the stored distance, update the distance and previous vertex
            if distance < distances[neighbor]:
                distances[neighbor] = distance
                previous[neighbor] = current_vertex
                # Add the neighbor to the priority queue
                heapq.heappush(priority_queue, (distance, neighbor))

    return distances, previous

# Example usage:
graph = {
    'A': {'B': 1, 'C': 4},
    'B': {'A': 1, 'C': 2, 'D': 5},
    'C': {'A': 4, 'B': 2, 'D': 1},
    'D': {'B': 5, 'C': 1}
}

distances, previous = dijkstra(graph, 'A')
print(distances)  # {'A': 0, 'B': 1, 'C': 3, 'D': 4}
```
In this implementation, the `graph` parameter is a dictionary where the keys are the vertices and the values are dictionaries that map adjacent vertices to their weights. The `start` parameter is the starting vertex for the algorithm. The function returns two dictionaries: `distances` and `previous`. The `distances` dictionary maps each vertex to its shortest distance from the starting vertex, and the `previous` dictionary maps each vertex to its previous vertex in the shortest path.

Reasoning and math

You can use the following code to get a response for reasoning and math:


instructions = [{"role": "user", "content": """Sarah went to a car shop to buy a car from Fred for $20,000 in 2024. She plans to sell the car but it depreciates by 2% each year after she has bought it. She went to a dealer in which that dealer told her that the car has only depreciated by 1.4% each year. After 7 years of using the car, Sarah decides to sell it directly to another person.

How much did Sarah sell the car for and what year is it? Explain the steps before answering. It's ok to make some assumptions as you come to your answer.

"""}]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 2000, "do_sample": True}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

You get the following output:

<s>[INST] Sarah went to a car shop to buy a car from Fred for $20,000 in 2024. She plans to sell the car but it depreciates by 2% each year after she has bought it. She went to a dealer in which that dealer told her that the car has only depreciated by 1.4% each year. After 7 years of using the car, Sarah decides to sell it directly to another person. 

How much did Sarah sell the car for and what year is it? Explain the steps before answering. It's ok to make some assumptions as you come to your answer. [/INST] </s>To solve this problem, we need to calculate the value of the car after 7 years, assuming two different annual depreciation rates: 2% and 1.4%.

First, let's calculate the value of the car after 7 years assuming a 2% annual depreciation rate:

Year 1: $20,000 - ($20,000 * 2%) = $19,600
Year 2: $19,600 - ($19,600 * 2%) = $19,208
Year 3: $19,208 - ($19,208 * 2%) = $18,821.60
Year 4: $18,821.60 - ($18,821.60 * 2%) = $18,442.22
Year 5: $18,442.22 - ($18,442.22 * 2%) = $18,071.62
Year 6: $18,071.62 - ($18,071.62 * 2%) = $17,709.76
Year 7: $17,709.76 - ($17,709.76 * 2%) = $17,356.58

So, assuming a 2% annual depreciation rate, the value of the car after 7 years would be approximately $17,356.58.

Next, let's calculate the value of the car after 7 years assuming a 1.4% annual depreciation rate:

Year 1: $20,000 - ($20,000 * 1.4%) = $19,728
Year 2: $19,728 - ($19,728 * 1.4%) = $19,463.23
Year 3: $19,463.23 - ($19,463.23 * 1.4%) = $19,205.80
Year 4: $19,205.80 - ($19,205.80 * 1.4%) = $18,955.66
Year 5: $18,955.66 - ($18,955.66 * 1.4%) = $18,712.73
Year 6: $18,712.73 - ($18,712.73 * 1.4%) = $18,476.95
Year 7: $18,476.95 - ($18,476.95 * 1.4%) = $18,248.33

So, assuming a 1.4% annual depreciation rate, the value of the car after 7 years would be approximately $18,248.33.

Now, since Sarah sells the car directly to another person, we can assume that she sold it for the higher of the two values. Therefore, she sold the car for $18,248.33.

As for the year in which she sold the car, since we know that Sarah purchased the car in 2024 and used it for 7 years, we can calculate the year as follows:

2024 + 7 = 2031

Therefore, Sarah sold the car in the year 2031.

So, Sarah sold the car for $18,248.33 in the year 2031.

Clean up

After you’re done running the notebook, delete all resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Mixtral-8x22B in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.

Now that you are aware of Mistral AI and their Mixtral 8x22B models, we encourage you to deploy an endpoint on SageMaker to perform inference testing and try out responses for yourself. Refer to the following resources for more information:


About the Authors

Marco Punio is a Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyper-scale on AWS. He is a qualified technologist with a passion for machine learning, artificial intelligence, and mergers and acquisitions. Marco is based in Seattle, WA, and enjoys writing, reading, exercising, and building applications in his free time.

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

June Won is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping application and last mile delivery.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Shane Rai is a Principal GenAI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using AWS’s breadth of cloud-based AI/ML services including model offerings from top tier foundation model providers.

Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his master’s from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.

Read More

Building Generative AI prompt chaining workflows with human in the loop

Building Generative AI prompt chaining workflows with human in the loop

Generative AI is a type of artificial intelligence (AI) that can be used to create new content, including conversations, stories, images, videos, and music. Like all AI, generative AI works by using machine learning models—very large models that are pretrained on vast amounts of data called foundation models (FMs). FMs are trained on a broad spectrum of generalized and unlabeled data. They’re capable of performing a wide variety of general tasks with a high degree of accuracy based on input prompts. Large language models (LLMs) are one class of FMs. LLMs are specifically focused on language-based tasks such as summarization, text generation, classification, open-ended conversation, and information extraction.

FMs and LLMs, even though they’re pre-trained, can continue to learn from data inputs or prompts during inference. This means that you can develop comprehensive outputs through carefully curated prompts. A prompt is the information you pass into an LLM to elicit a response. This includes task context, data that you pass to the model, conversation and action history, instructions, and even examples. The process of designing and refining prompts to get specific responses from these models is called prompt engineering.

While LLMs are good at following instructions in the prompt, as a task gets complex, they’re known to drop tasks or perform a task not at the desired accuracy. LLMs can handle complex tasks better when you break them down into smaller subtasks. This technique of breaking down a complex task into subtasks is called prompt chaining. With prompt chaining, you construct a set of smaller subtasks as individual prompts. Together, these subtasks make up the overall complex task. To accomplish the overall task, your application feeds each subtask prompt to the LLM in a pre-defined order or according to a set of rules.

While Generative AI can create highly realistic content, including text, images, and videos, it can also generate outputs that appear plausible but are verifiably incorrect. Incorporating human judgment is crucial, especially in complex and high-risk decision-making scenarios. This involves building a human-in-the-loop process where humans play an active role in decision making alongside the AI system.

In this blog post, you will learn about prompt chaining, how to break a complex task into multiple tasks to use prompt chaining with an LLM in a specific order, and how to involve a human to review the response generated by the LLM.

Example overview

To illustrate this example, consider a retail company that allows purchasers to post product reviews on their website. By responding promptly to those reviews, the company demonstrates its commitments to customers and strengthens customer relationships.

Figure 1: Customer review and response

The example application in this post automates the process of responding to customer reviews. For most reviews, the system auto-generates a reply using an LLM. However, if the review or LLM-generated response contains uncertainty around toxicity or tone, the system flags it for a human reviewer. The human reviewer then assesses the flagged content to make the final decision about the toxicity or tone.

The application uses event-driven architecture (EDA), a powerful software design pattern that you can use to build decoupled systems by communicating through events. As soon as the product review is created, the review receiving system uses Amazon EventBridge to send an event that a product review is posted, along with the actual review content. The event starts an AWS Step Functions workflow. The workflow runs through a series of steps including generating content using an LLM and involving human decision making.

Figure 2: Review workflow

The process of generating a review response includes evaluating the toxicity of the review content, identifying sentiment, generating a response, and involving a human approver. This naturally fits into a workflow type of application because it’s a single process containing multiple sequential steps along with the need to manage state between steps. Hence the example uses Step Functions for workflow orchestration. Here are the steps in the review response workflow.

  1. Detect if the review content has any harmful information using the Amazon Comprehend DetectToxicContent API. The API responds with the toxicity score that represents the overall confidence score of detection between 0 and 1 with score closer to 1 indicating high toxicity.
  2. If toxicity of the review is in the range of 0.4 – 0.6, send the review to a human reviewer to make the decision.
  3. If the toxicity of the review is greater than 0.6 or the reviewer finds the review harmful, publish HARMFUL_CONTENT_DETECTED message.
  4. If the toxicity of the review is less than 0.4 or reviewer approves the review, find the sentiment of the review first and then generate the response to the review comment. Both tasks are achieved using a generative AI model.
  5. Repeat the toxicity detection through the Comprehend API for the LLM generated response.
  6. If the toxicity of the LLM generated response is in the range of 0.4 – 0.6, send the LLM generated response to a human reviewer.
  7. If the LLM generated response is found to be non-toxic, publish NEW_REVIEW_RESPONSE_CREATED event.
  8. If the LLM generated response is found to be toxic, publish RESPONSE_GENERATION_FAILED event.

Figure 3: product review evaluation and response workflow

Getting started

Use the instructions in the GitHub repository to deploy and run the application.

Prompt chaining

Prompt chaining simplifies the problem for the LLM by dividing single, detailed, and monolithic tasks into smaller, more manageable tasks. Some, but not all, LLMs are good at following all the instructions in a single prompt. The simplification results in writing focused prompts for the LLM, leading to a more consistent and accurate response. The following is a sample ineffective single prompt.

Read the below customer review, filter for harmful content and provide your thoughts on the overall sentiment in JSON format. Then construct an email response based on the sentiment you determine and enclose the email in JSON format. Based on the sentiment, write a report on how the product can be improved.

To make it more effective, you can split the prompt into multiple subtasks:

  1. Filter for harmful content
  2. Get the sentiment
  3. Generate the email response
  4. Write a report

You can even run some of the tasks in parallel. By breaking down to focused prompts, you achieve the following benefits:

  • You speed up the entire process. You can handle tasks in parallel, use different models for different tasks, and send response back to the user rather than waiting for the model to process a larger prompt for considerably longer time.
  • Better prompts provide better output. With focused prompts, you can engineer the prompts by adding additional relevant context thus improving the overall reliability of the output.
  • You spend less time developing. Prompt engineering is an iterative process. Both debugging LLM calls for detailed prompt and refining the larger prompt for accuracy require significant time and effort. Smaller tasks enable you to experiment and refine through successive iterations.

Step Functions is a natural fit to build prompt chaining because it offers multiple different ways to chain prompts: sequentially, in parallel, and iteratively by passing the state data from one state to another. Consider the situation where you have built the product review response prompt chaining workflow and now want to evaluate the responses from different LLMs to find the best fit using an evaluation test suite. The evaluation test suite consists of hundreds of test product reviews, a reference response to the review, and a set of rules to evaluate the LLM response against the reference response. You can automate the evaluation activity using a Step Functions workflow. The first task in the workflow asks the LLM to generate a review response for the product review. The second task then asks the LLM to compare the generated response to the reference response using the rules and generate an evaluation score. Based on the evaluation score for each review, you can decide if the LLM passes your evaluation criteria or not. You can use the map state in Step Functions to run the evaluations for each review in your evaluation test suite in parallel. See this repository for more prompt chaining examples.

Human in the loop

Involving human decision making in the example allows you to improve the accuracy of the system when the toxicity of the content cannot be determined to be either safe or harmful. You can implement human review within the Step Functions workflow using Wait for a Callback with the Task Token integration. When you use this integration with any supported AWS SDK API, the workflow task generates a unique token and then pauses until the token is returned. You can use this integration to include human decision making, call a legacy on-premises system, wait for completion of long running tasks, and so on.

"Wait for human approval for product review": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:{region}:{account}:function:human-approval-helper-product-review-response-automation-stage",
        "Payload": {
          "review_text.$": "$$.Execution.Input.review_text",
          "token.$": "$$.Task.Token",
          "api_url": "https://{apiID}.execute-api.{region}.amazonaws.com/dev"
}

In the sample application, the send email for approval task includes a wait for the callback token. It invokes an AWS Lambda function with a token and waits for the token. The Lambda function builds an email message along with the link to an Amazon API Gateway URL. Lambda then uses Amazon Simple Notification Service (Amazon SNS) to send an email to a human reviewer. The reviewer reviews the content and either accepts or rejects the message by selecting the appropriate link in the email. This action invokes the Step Functions SendTaskSuccess API. The API sends back the task token and a status message of whether to accept or reject the review. Step Functions receives the token, resumes the send email for approval task and then passes control to the choice state. The choice state decides whether to go through acceptance or rejection of the review based on the status message.

Figure 4: Human-in-the-loop workflow

Event-driven architecture

EDA enables building extensible architectures. You can add consumers at any time by subscribing to the event. For example, consider moderating images and videos attached to a product review in addition to the text content. You also need to write code to delete the images and videos if they are found harmful. You can add a consumer, the image moderation system, to the NEW_REVIEW_POSTED event without making any code changes to the existing event consumers or producers. Development of the image moderation system and the review response system to delete harmful images can proceed in parallel which in turn improves development velocity.

When the image moderation workflow finds toxic content, it publishes a HARMFULL_CONTENT_DETECTED event. The event can be processed by a review response system that decides what to do with the event. By decoupling systems through events, you gain many advantages including improved development velocity, variable scaling, and fault tolerance.

Figure 5: Event-driven workflow

Cleanup

Use the instructions in the GitHub repository to delete the sample application.

Conclusion

In this blog post, you learned how to build a generative AI application with prompt chaining and a human-review process. You learned how both techniques improve the accuracy and safety of a generative AI application. You also learned how event-driven architectures along with workflows can integrate existing applications with generative AI applications.

Visit Serverless Land for more Step Functions workflows.


About the authors

Veda Raman is a Senior Specialist Solutions Architect for Generative AI and machine learning based at AWS. Veda works with customers to help them architect efficient, secure and scalable machine learning applications. Veda specializes in generative AI services like Amazon Bedrock and Amazon Sagemaker.

Uma Ramadoss is a Principal Solutions Architect at Amazon Web Services, focused on the Serverless and Integration Services. She is responsible for helping customers design and operate event-driven cloud-native applications using services like Lambda, API Gateway, EventBridge, Step Functions, and SQS. Uma has a hands on experience leading enterprise-scale serverless delivery projects and possesses strong working knowledge of event-driven, micro service and cloud architecture.

Read More

How LotteON built a personalized recommendation system using Amazon SageMaker and MLOps

How LotteON built a personalized recommendation system using Amazon SageMaker and MLOps

This post is co-written with HyeKyung Yang, Jieun Lim, and SeungBum Shim from LotteON.

LotteON aims to be a platform that not only sells products, but also provides a personalized recommendation experience tailored to your preferred lifestyle. LotteON operates various specialty stores, including fashion, beauty, luxury, and kids, and strives to provide a personalized shopping experience across all aspects of customers’ lifestyles.

To enhance the shopping experience of LotteON’s customers, the recommendation service development team is continuously improving the recommendation service to provide customers with the products they are looking for or may be interested in at the right time.

In this post, we share how LotteON improved their recommendation service using Amazon SageMaker and machine learning operations (MLOps).

Problem definition

Traditionally, the recommendation service was mainly provided by identifying the relationship between products and providing products that were highly relevant to the product selected by the customer. However, it was necessary to upgrade the recommendation service to analyze each customer’s taste and meet their needs. Therefore, we decided to introduce a deep learning-based recommendation algorithm that can identify not only linear relationships in the data, but also more complex relationships. For this reason, we built the MLOps architecture to manage the created models and provide real-time services.

Another requirement was to build a continuous integration and continuous delivery (CI/CD) pipeline that can be integrated with GitLab, a code repository used by existing recommendation platforms, to add newly developed recommendation models and create a structure that can continuously improve the quality of recommendation services through periodic retraining and redistribution of models.

In the following sections, we introduce the MLOps platform that we built to provide high-quality recommendations to our customers and the overall process of inferring a deep learning-based recommendation algorithm (Neural Collaborative Filtering) in real time and introducing it to LotteON.

Solution architecture

The following diagram illustrates the solution architecture for serving Neural Collaborative Filtering (NCF) algorithm-based recommendation models as MLOps. The main AWS services used are SageMaker, Amazon EMR, AWS CodeBuild, Amazon Simple Storage Service (Amazon S3), Amazon EventBridge, AWS Lambda, and Amazon API Gateway. We’ve combined several AWS services using Amazon SageMaker Pipelines and designed the architecture with the following components in mind:

  • Data preprocessing
  • Automated model training and deployment
  • Real-time inference through model serving
  • CI/CD structure

MLOps Architecture

The preceding architecture shows the MLOps data flow, which consists of three decoupled passes:

  • Code preparation and data preprocessing (blue)
  • Training pipeline and model deployment (green)
  • Real-time recommendation inference (brown)

Code preparation and data preprocessing

The preparation and preprocessing phase consists of the following steps:

  1. The data scientist publishes the deployment code containing the model and the training pipeline to GitLab, which is used by LotteON, and Jenkins uploads the code to Amazon S3.
  2. The EMR preprocessing batch runs through Airflow according to the specified schedule. The preprocessing data is loaded into MongoDB, which is used as a feature store along with Amazon S3.

Training pipeline and model deployment

The model training and deployment phase consists of the following steps:

  1. After the training data is uploaded to Amazon S3, CodeBuild runs based on the rules specified in EventBridge.
  2. The SageMaker pipeline predefined in CodeBuild runs, and sequentially runs steps such as preprocessing including provisioning, model training, and model registration.
  3. When training is complete (through the Lambda step), the deployed model is updated to the SageMaker endpoint.

Real-time recommendation inference

The inference phase consists of the following steps:

  1. The client application makes an inference request to the API gateway.
  2. The API gateway sends the request to Lambda, which makes an inference request to the model in the SageMaker endpoint to request a list of recommendations.
  3. Lambda receives the list of recommendations and provides them to the API gateway.
  4. The API gateway provides the list of recommendations to the client application using the Recommendation API.

Recommendation model using NCF

NCF is an algorithm based on a paper presented at the International World Wide Web Conference in 2017. It is an algorithm that covers the limitations of linear matrix factorization, which is often used in existing recommendation systems, with collaborative filtering based on the neural net. By adding non-linearity through the neural net, the authors were able to model a more complex relationship between users and items. The data for NCF is interaction data where users react to items, and the overall structure of the model is shown in the following figure (source: https://arxiv.org/abs/1708.05031).

NCF Model

Although NCF has a simple model architecture, it has shown a good performance, which is why we chose it to be the prototype for our MLOps platform. For more information about the model, refer to the paper Neural Collaborative Filtering.

In the following sections, we discuss how this solution helped us build the aforementioned MLOps components:

  • Data preprocessing
  • Automating model training and deployment
  • Real-time inference through model serving
  • CI/CD structure

MLOps component 1: Data preprocessing

For NCF, we used user-item interaction data, which requires significant resources to process the raw data collected at the application and transform it into a form suitable for learning. With Amazon EMR, which provides fully managed environments like Apache Hadoop and Spark, we were able to process data faster.

The data preprocessing batches were created by writing a shell script to run Amazon EMR through AWS Command Line Interface (AWS CLI) commands, which we registered to Airflow to run at specific intervals. When the preprocessing batch was complete, the training/test data needed for training was partitioned based on runtime and stored in Amazon S3. The following is an example of the AWS CLI command to run Amazon EMR:

aws emr create-cluster --release-label emr-6.0.0 
    --name "CLUSTER_NAME" 
    --applications Name=Hadoop Name=Hive Name=Spark 
    --tags 'Name=EMR-DATA-PREP' 'Owner=MODEL' 'Service=LOTTEON' 
    --ec2-attributes '{"KeyName":"keyname","InstanceProfile":"DefaultRole","ServiceAccessSecurityGroup":"sg-xxxxxxxxxxxxxx","SubnetId":"subnet- xxxxxxxxxxxxxx ","EmrManagedSlaveSecurityGroup":"sg- xxxxxxxxxxxxxx ","EmrManagedMasterSecurityGroup":"sg-xxxxxxxxxxxxxx "}' 
--instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"r5.xlarge","Name":"Master Instance Group"},{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"r5.xlarge","Name":"Core Instance Group"},{"InstanceCount":2,"BidPrice":"OnDemandPrice","InstanceGroupType":"TASK","InstanceType":"r5.xlarge","Name":"Task Instance Group"}]' 
    --service-role EMR_DefaultRole 
    --region ap-northeast-2 
    --steps Type=CUSTOM_JAR,Name=DATA_PREP,ActionOnFailure=CONTINUE,Jar=s3://ap-northeast-2.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://bucket/prefix/data_prep_batch.sh"]  
    --auto-terminate

MLOps component 2: Automated training and deployment of models

In this section, we discuss the components of the model training and deployment pipeline.

Event-based pipeline automation

After the preprocessing batch was complete and the training/test data was stored in Amazon S3, this event invoked CodeBuild and ran the training pipeline in SageMaker. In the process, the version of the result file of the preprocessing batch was recorded, enabling dynamic control of the version and management of the pipeline run history. We used EventBridge, Lambda, and CodeBuild to connect the data preprocessing steps run by Amazon EMR and the SageMaker learning pipeline on an event-based basis.

EventBridge is a serverless service that implements rules to receive events and direct them to destinations, based on the event patterns and destinations you establish. The initial role of EventBridge in our configuration was to invoke a Lambda function on the S3 object creation event when the preprocessing batch stored the training dataset in Amazon S3. The Lambda function dynamically modified the buildspec.yml file, which is indispensable when CodeBuild runs. These modifications encompassed the path, version, and partition information of the data that needed training, which is crucial for carrying out the training pipeline. The subsequent role of EventBridge was to dispatch events, instigated by the alteration of the buildspec.yml file, leading to running CodeBuild.

CodeBuild was responsible for building the source code where the SageMaker pipeline was defined. Throughout this process, it referred to the buildspec.yml file and ran processes such as cloning the source code and installing the libraries needed to build from the path defined in the file. The Project Build tab on the CodeBuild console allowed us to review the build’s success and failure history, along with a real-time log of the SageMaker pipeline’s performance.

SageMaker pipeline for training

SageMaker Pipelines helps you define the steps required for ML services, such as preprocessing, training, and deployment, using the SDK. Each step is visualized within SageMaker Studio, which is very helpful for managing models, and you can also manage the history of trained models and endpoints that can serve the models. You can also set up steps by attaching conditional statements to the results of the steps, so you can adopt only models with good retraining results or prepare for learning failures. Our pipeline contained the following high-level steps:

  • Model training
  • Model registration
  • Model creation
  • Model deployment

Each step is visualized in the pipeline in Amazon SageMaker Studio, and you can also see the results or progress of each step in real time, as shown in the following screenshot.

SageMaker Pipeline

Let’s walk through the steps from model training to deployment, using some code examples.

Train the model

First, you define a PyTorch Estimator to use for training and a training step. This requires you to have the training code (for example, train.py) ready in advance and pass the location of the code as an argument of the source_dir. The training step runs the training code you pass as an argument of the entry_point. By default, the training is done by launching the container in the instance you specify, so you’ll need to pass in the path to the training Docker image for the training environment you’ve developed. However, if you specify the framework for your estimator here, you can pass in the version of the framework and Python version to use, and it will automatically fetch the version-appropriate container image from Amazon ECR.

When you’re done defining your PyTorch Estimator, you need to define the steps involved in training it. You can do this by passing the PyTorch Estimator you defined earlier as an argument and the location of the input data. When you pass in the location of the input data, the SageMaker training job will download the train and test data to a specific path in the container using the format /opt/ml/input/data/<channel_name> (for example, /opt/ml/input/data/train).

In addition, when defining a PyTorch Estimator, you can use metric definitions to monitor the learning metrics generated while the model is being trained with Amazon CloudWatch. You can also specify the path where the results of the model artifacts after training are stored by specifying estimator_output_path, and you can use the parameters required for model training by specifying model_hyperparameters. See the following code:

from sagemaker.pytorch import PyTorch
metric_definitions=[
        {'Name': 'HR', 'Regex': 'HR=(.*?);'},
        {'Name': 'NDCG', 'Regex': 'NDCG=(.*?);'},
        {'Name': 'Loss', 'Regex': 'Loss=(.*?);'}
    ]
estimator_output_path = f's3://{bucket}/{prefix}'
model_hyperparameter = {'epochs': 10, 
                    'lr': 0.001,
                    'batch_size': 256,
                    'top_k' : 10,
                    'dropout' : 0.3,
                    'factor_num' : 32,
                    'num_layers' : 3
                }  
s3_code_uri = 's3://code_location/source.tar.gz'

host_estimator = PyTorch(
    entry_point="train.py", 
    source_dir = s3_code_uri, 
    output_path = estimator_output_path, 
    role=aws_role,
framework_version='1.8.1',
    py_version='py3',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    session = pipeline_session,
    hyperparameters=model_hyperparameter,
    metric_definitions = metric_definitions
)

from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep
data_loc = f's3://{bucket}/{prefix}'
step_train = TrainingStep(
    name= "NCF-Training",
    estimator=host_estimator,
    inputs={
        "train": TrainingInput(s3_data=data_loc),
        "test": TrainingInput(s3_data=data_loc),        
    }
)

Create a model package group

The next step is to create a model package group to manage your trained models. By registering trained models in model packages, you can manage them by version, as shown in the following screenshot. This information allows you to reference previous versions of your models at any time. This process only needs to be done one time when you first train a model, and you can continue to add and update models as long as they declare the same group name.

Model Packages

See the following code:

import boto3
model_package_group_name = 'NCF'
sm_client = boto3.client("sagemaker")
model_package_group_input_dict = {
    "ModelPackageGroupName" : model_package_group_name,
    "ModelPackageGroupDescription" : "Model Package Group"
}
response = sm_client.list_model_package_groups(NameContains=model_package_group_name)
if len(response['ModelPackageGroupSummaryList']) == 0:
create_model_pacakge_group_response = sm_client.create_model_package_group(**model_package_group_input_dict)

Add a trained model to a model package group

The next step is to add a trained model to the model package group you created. In the following code, when you declare the Model class, you get the result of the previous model training step, which creates a dependency between the steps. A step with a declared dependency can only be run if the previous step succeeds. However, you can use the DependsOn option to declare a dependency between steps even if the data is not causally related.

After the trained model is registered in the model package group, you can use this information to manage and track future model versions, create a real-time SageMaker endpoint, run a batch transform job, and more.

from sagemaker.workflow.model_step import ModelStep
from sagemaker.model import Model

inference_image_uri = '763104351884.dkr.ecr.ap-northeast-2.amazonaws.com/pytorch-inference:1.8.1-gpu-py3'
model = Model(
    image_uri=inference_image_uri,
    model_data = step_train.properties.ModelArtifacts.S3ModelArtifacts,
    role=role,
    sagemaker_session=pipeline_session,
)

register_model_step_args = model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    model_package_group_name=model_package_group_name,
    approval_status='Approved',        
)

step_model_registration = ModelStep(
    name="RegisterModel",
    step_args=register_model_step_args
)

Create a SageMaker model

To create a real-time endpoint, an endpoint configuration and model is required. To create a model, you need two basic elements: an S3 address where the model’s artifacts are stored, and the path to the inference Docker image that will run the model’s artifacts.

When creating a SageMaker model, you must pay attention to the following steps:

  • Provide the result of the model training step, step_train.properties.ModelArtifacts.S3ModelArtifacts, which will be converted to the S3 path where the model artifact is stored, as an argument of the model_data.
  • Because you specified the PyTorchModel class, framework_version, and py_version, you use this information to get the path to the inference Docker image through Amazon ECR. This is the inference Docker image that is used for model deployment. Make sure to enter the same PyTorch framework, Python version, and other details that you used to train the model. This means keeping the same PyTorch and Python versions for training and inference.
  • Provide the inference.py as the entry point script to handle invocations.

This step will set a dependency on the model package registration step you defined via the DependsOn option.

from sagemaker.pytorch.model import PyTorchModel
from sagemaker.workflow.model_step import ModelStep

model_name = 'NCF-MODEL'
s3_code_uri = 's3://code_location/source.tar.gz'

model_inference = PyTorchModel(
        name = model_name,
        model_data = step_train.properties.ModelArtifacts.S3ModelArtifacts, 
image_uri= image_uri,
        role=role,
        entry_point= 'inference.py',
        source_dir = s3_code_uri,
        framework_version='1.8.1',
        py_version='py3',
        model_server_workers=1,
        sagemaker_session=pipeline_session
                            )
step_model_create = ModelStep(
    name="ModelCreation",
    step_args=model_inference.create(instance_type = 'ml.p3.2xlarge'),
    depends_on=step_model_registration
)

Create a SageMaker endpoint

Now you need to define an endpoint configuration based on the created model, which will create an endpoint when deployed. Because the SageMaker Python SDK doesn’t support the step related to deployment (as of this writing), you can use Lambda to register that step. Pass the necessary arguments to Lambda, such as instance_type, and use that information to create the endpoint configuration first. Because you’re calling the endpoint based on endpoint_name, you need to make sure that variable is defined with a unique name. In the following Lambda function code, based on the endpoint_name, you update the model if the endpoint exists, and deploy a new one if it doesn’t:

# lambda_deploy_model.py
import json
import boto3
def lambda_handler(event, context):
    sm_client = boto3.client("sagemaker")
    model_name = event["model_name"]
    endpoint_config_name = event["endpoint_config_name"]
    endpoint_name = event["endpoint_name"]
    instance_type = event["instance_type"]
 
    create_endpoint_config_response = sm_client.create_endpoint_config(
        EndpointConfigName=endpoint_config_name,
        ProductionVariants=[
            {
                "InstanceType": instance_type,
                "InitialVariantWeight": 1,
                "InitialInstanceCount": 1,
                "ModelName": model_name,
                "VariantName": "AllTraffic",
            }
        ],
    )
    print(f"create_endpoint_config_response: {create_endpoint_config_response}")
    existing_endpoints = sm_client.list_endpoints(NameContains=endpoint_name)['Endpoints']
    if len(existing_endpoints["Endpoints"]) > 0:
        sm_client.update_endpoint(
            EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
        )
    else:
        sm_client.create_endpoint(
            EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
        )
    return {"statusCode": 200, "body": json.dumps("Endpoint Created Successfully")}

To get the Lambda function into a step in the SageMaker pipeline, you can use the SDK associated with the Lambda function. By passing the location of the Lambda function source as an argument of the function, you can automatically register and use the function. In conjunction with this, you can define LambdaStep and pass it the required arguments. See the following code:

from sagemaker.lambda_helper import Lambda
from sagemaker.workflow.lambda_step import (LambdaStep, LambdaOutput, LambdaOutputTypeEnum)
endpoint_name = 'NCF-ENDPOINT'
endpoint_config_name = 'NCF-CONF'
deploy_script_path = 's3://code_location/lambda_deploy_model.py'
deploy_model_func = Lambda(
    function_name='lambda-deploy-step',
    execution_role_arn=role,
    script=deploy_script_path,
    handler="lambda_deploy_model.lambda_handler"
)
output_param_1 = LambdaOutput(output_name="statusCode", output_type=LambdaOutputTypeEnum.String)
output_param_2 = LambdaOutput(output_name="body", output_type=LambdaOutputTypeEnum.String)

step_deploy_lambda = LambdaStep(
    name="LambdaDeployStep",
    lambda_func=deploy_model_func,
    inputs={
        "model_name": step_model_create.properties.ModelName,
        "endpoint_config_name": endpoint_config_name,
        "endpoint_name": endpoint_name,
        "instance_type": 'ml.p3.2xlarge',       
    },
    outputs=[output_param_1, output_param_2]
)

Create a SageMaker pipeline

Now you can create a pipeline using the steps you defined. You can do this by defining a name for the pipeline and passing in the steps to be used in the pipeline as arguments. After that, you can run the defined pipeline through the start function. See the following code:

from sagemaker.workflow.pipeline import Pipeline
pipeline_name = 'NCF-pipeline'
pipeline = Pipeline(
    name=pipeline_name,
    steps=[step_train, step_model_registration, step_model_create, step_deploy_lambda],
    sagemaker_session=pipeline_session,
)

pipeline.start()

After this process is complete, an endpoint is created with the trained model and is ready for use based on the deep learning-based model.

MLOps component 3: Real-time inference with model serving

Now let’s see how to invoke the model in real time from the created endpoint, which can also be accessed using the SageMaker SDK. The following code is an example of getting real-time inference values for input values from an endpoint deployed via the invoke_endpoint function. The features you pass as arguments to the body are passed as input to the endpoint, which returns the inference results in real time.

import boto3
sagemaker_runtime = boto3.client("sagemaker-runtime")
endpoint_name='NCF-ENDPOINT'
 
response = sagemaker_runtime.invoke_endpoint(
                    EndpointName=endpoint_name, 
                    Body=bytes("'features': '{"user": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], "item": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]}'}")
)
print(response['Body'].read())

When we configured the inference function, we had it return the items in the order that the user is most likely to like among the items passed in. The preceding example returns items from 1–25 in order of likelihood of being liked by the user at index 0.

We added business logic to the feature, configured it in Lambda, and connected it with an API gateway to implement the API’s ability to return recommended items in real time. We then conducted performance testing of the online service. We load tested it with Locust using five g4dn.2xlarge instances and found that it could be reliably served in an environment with 1,000 TPS.

MLOps component 4: CI/CD structure

A CI/CD structure is a fundamental part of DevOps, and is also an important part of organizing an MLOps environment. AWS CodeCommit, AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline collectively provide all the functionality you need for CI/CD, from code shaping to deployment, build, and batch management. The services are not only linked to the same code series, but also to other services such as GitHub and Jenkins, so if you have an existing CI/CD structure, you can use them separately to fill in the gaps. Therefore, we expanded our CI/CD structure by linking only the CodeBuild configuration described earlier to our existing CI/CD pipeline.

We linked our SageMaker notebooks with GitLab for code management, and when we were done, we replicated them to Amazon S3 via Jenkins. After that, we set the S3 path to the default repository path of the NCF CodeBuild project as described earlier, so that we could build the project with CodeBuild.

Conclusion

So far, we’ve seen the end-to-end process of configuring an MLOps environment using AWS services and providing real-time inference services based on deep learning models. By configuring an MLOps environment, we’ve created a foundation for providing high-quality services based on various algorithms to our customers. We’ve also created an environment where we can quickly proceed with prototype development and deployment. The NCF we developed with the prototyping algorithm was also able to achieve good results when it was put into service. In the future, the MLOps platform can help us quickly develop and experiment with models that match LotteON data to provide our customers with a progressively higher-quality recommendation experience.

Using SageMaker in conjunction with various AWS services has given us many advantages in developing and operating our services. As model developers, we didn’t have to worry about configuring the environment settings for frequently used packages and deep learning-related frameworks because the environment settings were configured for each library, and we felt that the connectivity and scalability between AWS services using AWS CLI commands and related SDKs were great. Additionally, as a service operator, it was good to track and monitor the services we were running because CloudWatch connected the logging and monitoring of each service.

You can also check out the NCF and MLOps configuration for hands-on practice on our GitHub repo (Korean).

We hope this post will help you configure your MLOps environment and provide real-time services using AWS services.


About the Authors

SeungBum Shim is a data engineer in the Lotte E-commerce Recommendation Platform Development Team, responsible for discovering ways to use and improve recommendation-related products through LotteON data analysis, and developing MLOps pipelines and ML/DL recommendation models.

HyeKyung Yang is a research engineer in the Lotte E-commerce Recommendation Platform Development Team and is in charge of developing ML/DL recommendation models by analyzing and utilizing various data and developing a dynamic A/B test environment.

Jieun Lim is a data engineer in the Lotte E-commerce Recommendation Platform Development Team and is in charge of operating LotteON’s personalized recommendation system and developing personalized recommendation models and dynamic A/B test environments.

Jesam Kim is an AWS Solutions Architect and helps enterprise customers adopt and troubleshoot cloud technologies and provides architectural design and technical support to address their business needs and challenges, especially in AIML areas such as recommendation services and generative AI.

Gonsoo Moon is an AWS AI/ML Specialist Solutions Architect and provides AI/ML technical support. His main role is to collaborate with customers to solve their AI/ML problems based on various use cases and production experience in AI/ML.

Read More

Build a serverless exam generator application from your own lecture content using Amazon Bedrock

Build a serverless exam generator application from your own lecture content using Amazon Bedrock

Crafting new questions for exams and quizzes can be tedious and time-consuming for educators. The time required varies based on factors like subject matter, question types, experience level, and class level. Multiple-choice questions require substantial time to generate quality distractors and ensure a single unambiguous answer, and composing effective true-false questions demands careful effort to avoid vagueness and assess deeper understanding. Creating high-quality assessment questions of any format necessitates meticulous attention to detail from educators in order to produce fair and valid student evaluations. To streamline this cumbersome process, we propose an automated exam generation solution based on Amazon Bedrock.

In this post, we explore how to build an application that generates tests tailored to your own lecture content. We cover the technical implementation using the Anthropic Claude large language model (LLM) on Amazon Bedrock and AWS Lambda deployed with the AWS Serverless Application Model (AWS SAM). This solution enables educators to instantly create curriculum-aligned assessments with minimal effort. Students can take personalized quizzes and get immediate feedback on their performance. This solution simplifies the exam creation process while benefiting both teachers and learners.

Amazon Bedrock

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon using a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. In this post, we focus on a text generation use case, and can choose from Amazon Titan Text G1 and other models on Amazon Bedrock, including Anthropic Claude, AI21 Labs Jurassic, Meta Llama 2, and Cohere Command.

With the ability to scale up to 200,000-token context windows, Anthropic Claude v2.1 on Amazon Bedrock is our preferred choice for this post. It is typically helpful when working with lengthy documents such as entire books. When we talk about tokens, we refer to the smallest individual “atoms” of a language model, and can varyingly correspond to words, subwords, characters, or even bytes (in the case of Unicode). For Anthropic Claude on Amazon Bedrock, the average token is about 3.5 English characters. The 200,000 tokens supported by Anthropic Claude v2.1 on Amazon Bedrock would be equivalent to roughly 150,000 words or over 500 pages of documents.

This post demonstrates how to use advanced prompt engineering to control an LLM’s behavior and responses. It shows how to randomly generate questions and answers from lecture files, implemented as a simple serverless application.

Solution overview

The following diagram illustrates the application architecture. We distinguish two paths: the educator path (1) and the learner path (2).

As first-time users, both educator and learner need to complete the sign-up process, which is done by two separate Amazon Cognito user pools. For the educator, when the sign-up is complete, Amazon Cognito invokes the Lambda function called CognitoPostSignupFn to subscribe the educator to an Amazon Simple Notification Service (Amazon SNS) topic. The educator must approve the subscription to this topic in order to be notified by email with the scorecard of each learner who will be taking the generated exam.

Figure 1: Architectural diagram of the exam generator application

The workflow includes the following steps:

  1. The educator opens the landing page for generating an exam under the domain gen-exam.<your-domain-name> through Amazon Route 53, which redirects the request to the Application Load Balancer (ALB).

1.1 The ALB communicates with Amazon Cognito to authenticate the educator on the educator user pool.

1.2 The educator uploads a lecture as a PDF file into the exam generation front-end.

1.3 The Amazon Elastic Container Service (Amazon ECS) container running on AWS Fargate uploads the file to Amazon Simple Storage Service (Amazon S3) in the Examgen bucket under the prefix exams.

1.4 The S3 bucket is configured using event notification. Whenever a new file is uploaded, a PutObject is activated to send the file to the ExamGenFn Lambda function.

1.5 The Lambda function ExamGenFn invokes the Anthropic Claude v2.1 model on Amazon Bedrock to generate exam questions and answers as a JSON file.

1.6 The Amazon Bedrock API returns the output Q&A JSON file to the Lambda function.

1.7 The ExamGenFn Lambda function saves the output file to the same S3 bucket under the prefix Questions-bank. (You can choose to save it to a different S3 bucket.)

1.8 The ExamGenFn Lambda function sends an email notification to the educator through the SNS topic to notify that the exam has been generated.

  1. The learner opens the landing page to take the exam under the domain take-exam.<your-domain-name> through Route 53, which redirects the request to the ALB.

2.1 The ALB communicates with Amazon Cognito to authenticate the learner on the learner user pool.

2.2 The learner accesses the frontend and selects a test to take.

2.3 The container image sends the REST API request to Amazon API Gateway (using the GET method).

2.4 API Gateway communicates with the TakeExamFn Lambda function as a proxy.

2.5 The Lambda TakeExamFn function retrieves from S3 bucket under the prefix Questions-bank the available exam in JSON format.

2.6 The JSON file is returned to API Gateway.

2.7 API Gateway transmits the JSON file to the ECS container in the front-end.

2.8 The container presents the exam as a UI using the Streamlit framework. The learner then takes the exams. When the learner is finished and submits their answers, the ECS container performs a comparison between the answers provided and the correct answers, and then shows the score results to the learner.

2.9 The ECS container stores the scorecard in an Amazon DynamoDB table.

2.10 The Lambda DynamoDBTriggerFn function detects the new scorecard record on the DynamoDB table and sends an email notification to the educator with the learner’s scorecard.

This is an event-driven architecture made up of individual AWS services that are loosely integrated with each other, with each service handling a specific function. It uses AWS serverless technologies, allowing you build and run your application without having to manage your own servers. All server management is done by AWS, providing many benefits such as automatic scaling and built-in high availability, letting you take your idea to production quickly.

Prerequisites

In this section, we go through the prerequisite steps to complete before you can set up this solution.

Enable model access through Amazon Bedrock

You can add access to a model from the Amazon Bedrock console. For this walkthrough, you need to request access to the Anthropic Claude model on Amazon Bedrock. For more information, see Model access.

Install the necessary packages

You need to install the following:

Register a DNS domain and create certificates

If you don’t already have a DNS domain registered, you need to create one in order to not expose the DNS of your ALB. For instructions, refer to Registering a new domain.

You also need to request two public certificates, one for each front-end: gen-exam.<your-domain-name> and take-exam.<your-domain-name>. Refer to Requesting a public certificate to request a public certificate on AWS Certificate Manager.

Save the values for genCertificateArn and takeCertificateArn.

If you want to build the app in a development environment without using your own domain, you can uncomment the following section in the sam template:

# un-comment if you need to test with HTTP traffic and no certifcate
#  ExamGenALBHTTPListener:
#    Type: AWS::ElasticLoadBalancingV2::Listener
#    Properties:
#      LoadBalancerArn: !Ref ExamGenALB
#      Protocol: HTTP
#      Port: 80
#      DefaultActions:
#        - Type: forward
#          TargetGroupArn: !Ref ExamGenTG

Chain-of-Thought (CoT) Prompting

Before we embark on constructing the app, let’s delve into prompt engineering. We use Chain-of-Thought (CoT) Prompting, which allows the model to break down complex reasoning into smaller, more manageable steps. By providing the AI with intermediate prompts that guide its reasoning process step by step, CoT prompting enables the model to tackle sophisticated reasoning tasks. Guiding the AI through an analytical chain of thought in this way allows it to develop complex reasoning capabilities that would otherwise be beyond its unaided abilities.

In the ExamGenFn Lambda function, we use the following prompt to guide the model through reasoning steps. You can change the prompt and give it different personas and instructions, and see how it behaves.

template_instruction = f"""Human: 
You are a teacher during examination time and you are responsible for creating exam questions from the student study book.
Before creating the questions
- Analyze the book found between <exam_book> </exam_book> tags, to identify distinct chapters, sections, or themes for question generation.
- For true/false questions, select statements that can be clearly identified as true or false based on the book's content.
- For MCQs, develop questions that challenge the understanding of the material, ensuring one correct answer and {n_mcq_options-1} distractors that are relevant but incorrect.
- Randomize the selection of pages or topics for each run to generate a new set of questions, ensuring no two sets are identical.
Please provide the questions in this format exactly for MCQ:
- The output should be like     
"question": "What is the colour of the car in the book?",
"options": ["Blue", "Green", "Yellow", "Grey"],
"correct_answer": "Yellow"
For True/False:
- the output should be like     
"question": "is the sky Blue?",
"options": ["True", "False"],
"correct_answer": "True"
                               
Generate {n_tfq} true/false and {n_mcq} multiple-choice questions (MCQs) ensuring each question pertains to different pages or topics within the book. For MCQs, provide [n_mcq_options] options for each question. Focus on creating unique questions that cover a broad spectrum of the book's content, avoiding repetition and ensuring a diverse examination of the material. Use the following guidelines:
                               
1. True/False Questions:
- Craft each true/false question based on factual statements or key concepts from the book.
- Ensure each question spans a wide range of topics to cover the book comprehensively.
                               
                               
2. Multiple-Choice Questions (MCQs):
- Formulate each MCQ to assess understanding of significant themes, events, or facts.
- Include {n_mcq_options} options per MCQ, making sure one is correct and the others are plausible but incorrect.
- Diversify the content areas and pages/topics for each MCQ to avoid overlap and repetition. 
""" 

Build the exam generator application

The application presented in this post is available in the following GitHub repo with the building blocks code. Let’s start with a git pull on the repo.

We recommend using temporary credentials with the AWS CLI to make programmatic requests for AWS resources using the AWS CLI.

Build the front-end using Streamlit and Docker

You build two containers, one for generating exams and one for taking exams. Let’s start with building the generating exam Docker image:

  1. Go to the following path in the repo and build your Docker image:
user@exam-gen ~ % cd exam-gen-ai-blog/frontend/generate-exam-fe

user@exam-gen generate-exam-fe % docker build -t <your-image-name>:tag .
  1. Authenticate the Docker CLI to Amazon Elastic Container Registry (Amazon ECR):
aws ecr get-login-password --region <your-region> | docker login --username AWS --password-stdin <your-account-id>.dkr.ecr.<your-region>.amazonaws.com
  1. Create a new repository in Amazon ECR:
aws ecr create-repository --repository-name <your-repository-name>
  1. Tag your Docker image with the ECR repository URI:
docker tag <your-image-name>:tag your-account-id.dkr.ecr.<your-region>.amazonaws.com/<your-ecr-repository>:tag
  1. Push your tagged Docker image to your ECR repository:
docker push <your-account-id>.dkr.ecr.<your-region>.amazonaws.com/<your-ecr-repository>:tag
  1. Navigate to this path in the repo to build your Docker image for taking the exam:
user@exam-gen ~ % cd exam-gen-ai-blog/frontend/take-exam-fe
  1. Because the authentication and the ECR repo are already done, run directly the following command:
user@exam-gen take-exam-fe % docker build -t <your-image-name>:tag .

docker tag <your-image-name>:tag your-account-id.dkr.ecr.<your-region>.amazonaws.com/<your-ecr-repository>:tag

docker push <your-account-id>.dkr.ecr.<your-region>.amazonaws.com/<your-ecr-repository>:tag
  1. Copy the values for GenExamImageUri and TakeExamImageUri.

Now that you have both containers ready to run, let’s build the rest of the components using AWS SAM.

Build solution components with AWS SAM

AWS SAM consists of two parts:

  • AWS SAM template specification – An open source framework that you can use to define your serverless application infrastructure on AWS
  • AWS SAM CLI – A command line tool that you can use with AWS SAM templates and supported third-party integrations to build and run your serverless applications

For further information, refer to Using the AWS Serverless Application Model (AWS SAM).

  1. Go to the home directory user@exam-gen ~ % cd exam-gen-ai-blog and run the sam build command.

Before you run sam deploy, be aware of the following:

  • The ECS containers are deployed on Fargate, which needs a VPC with two subnets in different Availability Zones. We use the default VPC for simplicity. You can create your own VPC or use an existing one in your AWS account and update the sam template. To list your VPC IDs and subnets within a selected VPC ID, run the following commands to extract your VpcId and your two SubnetId:
aws ec2 describe-vpcs
aws ec2 describe-subnets
  • GenExamCallbackURL (for generating exam) and TakeExamCallbackURL (for taking exam) are used by Amazon Cognito. They are URLs where the user is redirected to after a successful sign-in.
  1. Now let’s deploy the sam template:
sam deploy --stack-name <your-stack-name> --guided 
 --parameter-overrides 
 DefaultVPCID="your-default-vpc-id" 
 SubnetIdOne="your-subnet-one-id" 
 SubnetIdTwo="your-subnet-two-id" 
 genCertificateArn="arn:aws:acm:<your-region>:<your-account-id>:certificate/<your-certificate-id>" 
 takeCertificateArn="arn:aws:acm:<your-region>:<your-account-id>:certificate/<your-certificate-id>" 
 GenExamImageUri="<your-gen-image-uri>" 
 TakeExamImageUri="<your-take-image-uri>" 
 GenExamCallbackURL="gen-exam.<your-domain-name>" 
 TakeExamCallbackURL="take-exam.<your-domain-name>" 
 NotificationEmail="your-email-address@example.com" 
 --capabilities CAPABILITY_NAMED_IAM 
        #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
        Confirm changes before deploy [Y/n]: n
        #SAM needs permission to be able to create roles to connect to the resources in your template
        Allow SAM CLI IAM role creation [Y/n]: y
        #Preserves the state of previously provisioned resources when an operation fails
        Disable rollback [Y/n]: n
        Save arguments to configuration file [Y/n]: n

        Looking for resources needed for deployment:
        Creating the required resources...

        Successfully created!

You can follow the creation on the AWS CloudFormation console.

This following video demonstrates running the sam build and sam deploy commands.

Figure 2: SAM build and SAM deploy execution

  1. The final step is to get the DNS names for the deployed ALB, map them to the certificate domains names in Route 53, and add them as a CNAME record.

Test the solution

You can use your browser to test the solution.

  1. Navigate to gen-exam.<your-domain-name>.

You’ll receive an email with a confirmation code.

  1. Enter the verification code and choose Confirm account.

Once verified, you will land on a page to generate your quiz.

  1. Choose the amount of multiple choice and true/false questions you want to generate, then choose Browse files to upload an input file.

For this example, we use the whitepaper AWS Cloud Adoption Framework: Security Perspective as our input file. We generate four multiple-choice questions and one true/false question.

  1. Confirm your subscription to the SNS topic (you’ll receive an email).

Then you’ll receive an email confirming the exam has been generated.

  1. Switch to take-exam.<your-domain-name>, and you’ll find the exam on the dropdown menu.
  1. Choose the exam, then choose Load quiz.

  1. Then you can take the exam and choose Submit to display the results.

The educator will receive an email with the scorecard of the learner.

You have just built a simple application that randomly generates questions and answers from uploaded documents. Learners can take the generated exams and educators can receive scorecards via email when tests are complete. The integration with the DynamoDB table allows you to store the responses on a long-term basis.

Expanding the solution

There are many possibilities to build on top of this and create a fully featured learning and testing application. One area of expansion is uploading multiple documents at once. As of this writing, users can only upload one document at a time, but support for bulk uploads would improve efficiency and make it easier to work with large sets of source materials. Educators could be empowered to gather and upload content from various documents and websites as source material for questions. This provides greater flexibility compared to using a single document. Moreover, with a data store, they could view and analyze learner answers via a scorecard interface to track progress over time.

Clean up

It’s important to clean up your resources in the following order:

  1. On the Amazon S3 console, empty the bucket by deleting any files and folders.
  1. On the AWS CloudFormation console, delete the stack.

Conclusion

In this post, we showed how to build a generative AI application powered by Amazon Bedrock that creates exam questions using lecture documents as input to support educators with an automated tool to continuously modernize quiz material and improve learners’ skills. Learners will be able to take the freshly generated exam and get the score results. With the capabilities of Amazon Bedrock and the AWS SAM, you can increase educators’ productivity and foster student success.

For more information on working with generative AI on AWS for education use cases, refer to Generative AI in education: Building AI solutions using course lecture content.


About the Authors

Merieme Ezzaouia is a Solutions Architect at AWS dedicated to the public sector. She helps customers in education and sports turn their concepts into tangible solutions, develop new services, and foster innovation. Beyond work, Merieme’s passions include gardening, traveling the world, and reading.

Mohammed Reda is a Solutions Architect at Amazon Web Services. He helps UK schools, universities, and EdTech companies adopt cloud technologies, improve their educational offerings, and innovate on AWS. Outside of work, Mohammed enjoys running and watching cooking shows.

Read More

Accelerate NLP inference with ONNX Runtime on AWS Graviton processors

Accelerate NLP inference with ONNX Runtime on AWS Graviton processors

ONNX is an open source machine learning (ML) framework that provides interoperability across a wide range of frameworks, operating systems, and hardware platforms. ONNX Runtime is the runtime engine used for model inference and training with ONNX.

AWS Graviton3 processors are optimized for ML workloads, including support for bfloat16, Scalable Vector Extension (SVE), and Matrix Multiplication (MMLA) instructions. Bfloat16 accelerated SGEMM kernels and int8 MMLA accelerated Quantized GEMM (QGEMM) kernels in ONNX have improved inference performance by up to 65% for fp32 inference and up to 30% for int8 quantized inference for several natural language processing (NLP) models on AWS Graviton3-based Amazon Elastic Compute Cloud (Amazon EC2) instances. Starting version v1.17.0, the ONNX Runtime supports these optimized kernels.

In this post, we show how to run ONNX Runtime inference on AWS Graviton3-based EC2 instances and how to configure them to use optimized GEMM kernels. We also demonstrate the resulting speedup through benchmarking.

Optimized GEMM kernels

ONNX Runtime supports the Microsoft Linear Algebra Subroutine (MLAS) backend as the default Execution Provider (EP) for deep learning operators. AWS Graviton3-based EC2 instances (c7g, m7g, r7g, c7gn, and Hpc7g instances) support bfloat16 format and MMLA instructions for the deep learning operator acceleration. These instructions improve the SIMD hardware utilization and reduce the end-to-end inference latency by up to 1.65 times compared to the armv8 DOT product instruction-based kernels.

The AWS team implemented MLAS kernels for bfloat16 fast math and int8 quantized General Matrix Multiply (GEMM) using BFMMLA, SMMLA, and UMMLA instructions, which have higher matrix multiplication throughput compared to DOT instructions. The bfloat16 support allows efficient deployment of models trained using bfloat16, fp32, and automatic mixed precision (AMP) without the need for quantization. As shown in the following diagrams, the optimized GEMM kernels are integrated into the ONNX Runtime CPU EP as MLAS kernels.

The first figure illustrates the ONNX software stack, highlighting (in orange) the components optimized for inference performance improvement on the AWS Graviton3 platform.

onnx_highlevel_stack_graviton_kernels

The following diagram illustrates the ONNX Runtime EP flow, highlighting (in orange) the components optimized for inference performance improvement on the AWS Graviton3 platform.

onnxruntime_flow_Graviton_kernels

Enable the optimizations

The optimizations are part of the ONNX Runtime 1.17.0 release, and are available starting with onnxruntime-1.17.0 python wheels and conda-1.17.0 packages. Optimized int8 kernels are enabled by default, and will be picked up automatically for AWS Graviton3 Processors. Bfloat16 fast math kernels, on the other hand, are not enabled by default and need the following session options in ONNX Runtime to enable them:

# For C++ applications

SessionOptions so; 
so.config_options.AddConfigEntry( kOrtSessionOptionsMlasGemmFastMathArm64Bfloat16, "1");

# For Python applications

sess_options = onnxruntime.SessionOptions()
sess_options.add_session_config_entry("mlas.enable_gemm_fastmath_arm64_bfloat16", "1")

Benchmark results

We started with measuring the inference throughput, in queries per second, for the fp32 model without any of our optimizations (using ONNX Runtime 1.16.0), which is marked at 1.0 with the red dotted line in the following graph. Then we compared the improvements from bfloat16 fast math kernels from ONNX Runtime 1.17.1 for the same fp32 model inference. The normalized results are plotted in the graph. You can see that for the BERT, RoBERTa, and GPT2 models, the throughput improvement is up to 65%. Similar improvements are observed for the inference latency.

fp32_perf_improvement_onnx

Similar to the preceding fp32 inference comparison graph, we started with measuring the inference throughput, in queries per second, for the int8 quantized model without any of our optimizations (using ONNX Runtime 1.16.0), which is marked at 1.0 with the red dotted line in the following graph. Then we compared the improvements from the optimized MMLA kernels from ONNX Runtime 1.17.1 for the same model inference. The normalized results are plotted in the graph. You can see that for the BERT, RoBERTa, and GPT2 models, the throughput improvement is up to 30%. Similar improvements are observed for the inference latency.

int8_perf_improvement_onnx

Benchmark setup

We used an AWS Graviton3-based c7g.4xl EC2 instance with Ubuntu 22.04 based AMI to demonstrate the performance improvements with the optimized GEMM kernels from ONNX Runtime. The instance and the AMI details are mentioned in the following snippet:

Instance: c7g.4xl instance
Region: us-west-2
AMI: ami-0a24e6e101933d294 (Ubuntu 22.04/Jammy with 6.5.0-1014-aws kernel)

The ONNX Runtime repo provides inference benchmarking scripts for transformers-based language models. The scripts support a wide range of models, frameworks, and formats. We picked PyTorch-based BERT, RoBERTa, and GPT models to cover the common language tasks like text classification, sentiment analysis, and predicting the masked word. The models cover both encoder and decoder transformers architecture.

The following code lists the steps to run inference for the fp32 model with bfloat16 fast math mode and int8 quantized mode using the ONNX Runtime benchmarking script. The script downloads the models, exports them to ONNX format, quantizes them into int8 for int8 inference, and runs inference for different sequence lengths and batch sizes. Upon successful completion of the script, it will print the inference throughput in queries/sec (QPS) and latency in msec along with the system configuration. Refer to the ONNX Runtime Benchmarking script for more details.

# Install Python
sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Upgrade pip3 to the latest version
python3 -m pip install --upgrade pip

# Install onnx and onnx runtime
# NOTE: We used 1.17.1 instead of 1.17.0 as it was the latest
# version available while collecting data for this post
python3 -m pip install onnx==1.15.0 onnxruntime==1.17.1

# Install the dependencies
python3 -m pip install transformers==4.38.1 torch==2.2.1 psutil==5.9.8

# Clone onnxruntime repo to get the benchmarking scripts
git clone --recursive https://github.com/microsoft/onnxruntime.git
cd onnxruntime
git checkout 430a086f22684ad0020819dc3e7712f36fe9f016
cd onnxruntime/python/tools/transformers

# To run bert-large fp32 inference with bfloat16 fast math mode
python3 benchmark.py -m bert-large-uncased -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run bert-base  fp32 inference with bfloat16 fast math mode
python3 benchmark.py -m bert-base-cased -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run roberta-base  fp32 inference with bfloat16 fast math mode
python3 benchmark.py -m roberta-base -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run gpt2  fp32 inference with bfloat16 fast math mode
python3 benchmark.py -m gpt2 -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run bert-large int8 quantized inference
python3 benchmark.py -m bert-large-uncased -p int8

# To run bert-base int8 quantized inference
python3 benchmark.py -m bert-base-cased -p int8

# To run roberta-base int8 quantized inference
python3 benchmark.py -m roberta-base -p int8

# To run gpt2 int8 quantized inference
python3 benchmark.py -m gpt2 -p int8

Conclusion

In this post, we discussed how to run ONNX Runtime inference on an AWS Graviton3-based EC2 instance and how to configure the instance to use optimized GEMM kernels. We also demonstrated the resulting speedups. We hope that you will give it a try!

If you find use cases where similar performance gains are not observed on AWS Graviton, please open an issue on the AWS Graviton Technical Guide GitHub to let us know about it.


About the Author

Sunita Nadampalli is a Software Development Manager at AWS. She leads Graviton software performance optimizations for Machine Learning and HPC workloads. She is passionate about open source software development and delivering high-performance and sustainable software solutions with Arm SoCs.

Read More

Learn how Amazon Ads created a generative AI-powered image generation capability using Amazon SageMaker

Learn how Amazon Ads created a generative AI-powered image generation capability using Amazon SageMaker

Amazon Ads helps advertisers and brands achieve their business goals by developing innovative solutions that reach millions of Amazon customers at every stage of their journey. At Amazon Ads, we believe that what makes advertising effective is delivering relevant ads in the right context and at the right moment within the consumer buying journey. With that goal, Amazon Ads has used artificial intelligence (AI), applied science, and analytics to help its customers drive desired business outcomes for nearly two decades.

In a March 2023 survey, Amazon Ads found that among advertisers who were unable to build successful campaigns, nearly 75 percent cited building the creative content as one of their biggest challenges. To help advertisers more seamlessly address this challenge, Amazon Ads rolled out an image generation capability that quickly and easily develops lifestyle imagery, which helps advertisers bring their brand stories to life. This blog post shares more about how generative AI solutions from Amazon Ads help brands create more visually rich consumer experiences.

In this blog post, we describe the architectural and operational details of how Amazon Ads implemented its generative AI-powered image creation solution on AWS. Before diving deeper into the solution, we start by highlighting the creative experience of an advertiser enabled by generative AI. Next, we present the solution architecture and process flows for machine learning (ML) model building, deployment, and inferencing. We end with lessons learned.

Advertiser creative experience

When building ad creative, advertisers prefer to customize the creative in a way that makes it relevant to their desired audiences. For example, an advertiser might have static images of their product against a white background. From an advertiser point of view, the process is handled in three steps:

  1. Image generation converts product-only images into rich, contextually relevant images using generative AI. The approach preserves the original product features, requiring no technical expertise.
  2. Anyone with access to the Amazon Ads console can create custom brand images without needing technical or design expertise.
  3. Advertisers can create multiple contextually relevant and engaging product images with no additional cost.

A benefit of the image-generation solution is the automatic creation of relevant product images based on product selection only, with no additional input required from the advertisers. While there are options to enhance background imagery such as prompts, themes, and custom product images, they are not necessary to generate compelling creative. If advertisers do not supply this information, the model will infer it based on information from their product listing on amazon.com.

An example screenshot from Amazon Ads generator where a product with various background.

Figure 1. An example from the image generation solution showing a hydro flask with various backgrounds.

Solution overview

Figure 2 shows a simplified solution architecture for inferencing and model deployment. The steps for the model development and deployment are shown in blue circles and depicted by roman-numerals (i,ii, … iv.) whereas inferencing steps are in orange with Hindu-Arabic numbers (1,2,… 8.).

AWS solution architecture showing the architecture for the Amazon Ads solution.

Figure 2. Solution architecture for inferencing and model deployment.

Amazon SageMaker is at the center of model development and deployment. The team used Amazon SageMaker JumpStart to rapidly prototype and iterate under their desired conditions (step i). Acting as a model hub, JumpStart provided a large selection of foundation models and the team quickly ran their benchmarks on candidate models. After selecting candidate large language models (LLMs), the science teams can proceed with the remaining steps by adding more customization. Amazon Ads applied scientists use SageMaker Studio as the web-based interface to work with SageMaker (step ii). SageMaker has the appropriate access policies to view some intermediary model results, which can be used for further experimentation (step iii).

The Amazon Ads team manually reviewed images at scale through a human-in-the-loop process where the team ensured that the application provides high quality and responsible images. To do that, the team deployed testing endpoints using SageMaker and generated a large number of images spanning various scenarios and conditions (step iv). Here, Amazon SageMaker Ground Truth allowed ML engineers to easily build the human-in-the-loop workflow (step v). The workflow allowed the Amazon Ads team to experiment with different foundation models and configurations through blind A/B testing to ensure that feedback to the generated images is unbiased. After the chosen model is ready to be moved into production, the model is deployed (step vi) using the team’s own in-house Model Lifecycle Manager tool. Under the hood, this tool uses artifacts generated by SageMaker (step vii) which is then deployed into the production AWS account (step viii), using SageMaker SDKs .

Regarding the inference, customers using Amazon Ads now have a new API to receive these generated images. The Amazon API Gateway receives the PUT request (step 1). The request is then processed by AWS Lambda, which uses AWS Step Functions to orchestrate the process (step 2). The product image is fetched from an image repository, which is a part of an existing solution predating this creative feature. The next step is to process customer text prompts and customize the image through content ingestion guardrails. Amazon Comprehend is used to detect undesired context in the text prompt, whereas Amazon Rekognition processes images for content moderation purposes (step 3). If the inputs pass the inspection, then the text continues as a prompt, while the image is processed by removing the background (step 4). Then, the deployed text-to-image model is used for image generation using the prompt and the processed image (step 5). The image is then uploaded into an Amazon Simple Storage Services (Amazon S3) bucket for images and the metadata about the image is stored in an Amazon DynamoDB table (step 6). This whole process starting from step 2 is orchestrated by AWS Step Functions. Finally, the Lambda function receives the image and meta-data (step 7) which are then sent to the Amazon Ads client service through the API Gateway (step 8).

Conclusion

This post presented the technical solution for the Amazon Ads generative AI-powered image generation solution, which advertisers can use to create customized brand images without needing a dedicated design team. Advertisers have a series of features to generate and customize images such as writing text prompts, selecting different themes, swapping the featured product, or uploading a new image of the product from their device or asset library allowing them to create impactful images for advertising their products.

The architecture uses modular microservices with separate components for model development, registry, model lifecycle management (which is an orchestration and step function-based solution to process advertiser inputs), select the appropriate model, and track the job throughout the service, and a customer facing API. Here, Amazon SageMaker is at the center of the solution, starting from JumpStart to final SageMaker deployment.

If you plan to build your generative AI application on Amazon SageMaker, the fastest way is with SageMaker JumpStart. Watch this presentation to learn how you can start your project with JumpStart.


About the Authors

Anita Lacea is the Single-Threaded Leader of generative AI image ads at Amazon, enabling advertisers to create visually stunning ads with the click of a button. Anita pairs her broad expertise across the hardware and software industry with the latest innovations in generative AI to develop performant and cost-optimized solutions for her customers, revolutionizing the way businesses connect with their audiences. She is passionate about traditional visual arts and is an exhibiting printmaker.

Burak Gozluklu is a Principal AI/ML Specialist Solutions Architect located in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. Burak is still a research affiliate in MIT. Burak is passionate about yoga and meditation.

Christopher de Beer is a senior software development engineer at Amazon located in Edinburgh, UK. With a background in visual design. He works on creative building products for advertising, focusing on video generation, helping advertisers to reach their customers through visual communication. Building products that automate creative production, using traditional as well as generative techniques, to reduce friction and delight customers. Outside of his work as an engineer Christopher is passionate about Human-Computer Interaction (HCI) and interface design.

Yashal Shakti Kanungo is an Applied Scientist III at Amazon Ads. His focus is on generative foundational models that take a variety of user inputs and generate text, images, and videos. It’s a blend of research and applied science, constantly pushing the boundaries of what’s possible in generative AI. Over the years, he has researched and deployed a variety of these models in production across the online advertising spectrum ranging from ad sourcing, click-prediction, headline generation, image generation, and more.

Sravan Sripada is a Senior Applied Scientist at Amazon located in Seattle, WA. His primary focus lies in developing generative AI models that enable advertisers to create engaging ad creatives (images, video, etc.) with minimal effort. Previously, he worked on utilizing machine learning for preventing fraud and abuse on the Amazon store platform. When not at work, He is passionate about engaging in outdoor activities and dedicating time to meditation.

Cathy Willcock is a Principal Technical Business Development Manager located in Seattle, WA. Cathy leads the AWS technical account team  supporting Amazon Ads adoption of AWS cloud technologies. Her team works across Amazon Ads enabling discovery, testing, design, analysis, and deployments of AWS services at scale, with a particular focus on innovation to shape the landscape across the AdTech and MarTech industry. Cathy has led engineering,  product, and marketing  teams and is an inventor of ground-to-air calling (1-800-RINGSKY).

Read More

RAG architecture with Voyage AI embedding models on Amazon SageMaker JumpStart and Anthropic Claude 3 models

RAG architecture with Voyage AI embedding models on Amazon SageMaker JumpStart and Anthropic Claude 3 models

This post is a guest post co-written with Tengyu Ma and Wen Phan from Voyage AI.

Organizations today have access to vast amounts of data, much of it proprietary, which holds the potential to unlock valuable insights when used effectively in generative artificial intelligence (AI) applications. Retrieval Augmented Generation (RAG) is a powerful technique designed to tap into this reservoir of information. By dynamically pulling relevant data from these extensive databases during the response generation process, RAG enables AI models to produce more accurate, relevant, and contextually rich outputs.

Embedding models are crucial components in the RAG architecture, serving as the foundation for effectively identifying and retrieving the most relevant information from a large dataset. These models convert large volumes of text into compact, numerical representations, allowing the system to quickly sift through and match query-related data with unprecedented precision. By facilitating a more efficient and accurate retrieval process, embedding models make sure that the generative component of RAG is fed with the most pertinent information.

In this post, we provide an overview of the state-of-the-art embedding models by Voyage AI and show a RAG implementation with Voyage AI’s text embedding model on Amazon SageMaker Jumpstart, Anthropic’s Claude 3 model on Amazon Bedrock, and Amazon OpenSearch Service. Voyage AI’s embedding models are the preferred embedding models for Anthropic. In addition to general-purpose embedding models, Voyage AI offers domain-specific embedding models that are tuned to a particular domain.

RAG architecture and embedding models

RAG is the predominant design pattern for enterprise chatbots where a retrieval system fetches validated sources and documents that are pertinent to the query and inputs them to a large language model (LLM) to generate a response. It combines the generative capabilities of models with the informational breadth found in vast databases, enabling the model to pull relevant external documents to enhance its responses. This results in outputs that are not only contextually rich but also factually accurate, significantly boosting the reliability and utility of LLMs across diverse applications.

Let’s briefly review RAG using the following figure.

RAG systems are empowered by semantic search using dense-vector representations of the documents called embeddings. These vectors are stored in a vector store, where they can be efficiently retrieved later. At query time, a query is also converted into a vector and then used to find and retrieve similar documents stored in the vector store via a k-nearest neighbor (k-NN) search against the document vector representations. Finally, the retrieved documents along with the query are used to prompt the generative model, often resulting in higher-quality responses and fewer hallucinations.

Embedding models are neural network models that transform queries and documents into embeddings. The retrieval quality is solely decided by how the data is represented as vectors, and the effectiveness of embedding models is evaluated based on their accuracy in retrieving relevant information. Therefore, the retrieval quality of the embedding models is highly correlated with the quality of the RAG system responses—to make your RAG more successful, you should consider improving your embeddings. Check out this blog for a detailed explanation.

Voyage AI’s general-purpose and domain-specific embedding models

Voyage AI develops cutting-edge embedding models with state-of-the-art retrieval accuracy. voyage-large-2 is Voyage’s most powerful generalist embedding model, outperforming popular competing models. Voyage also offers voyage-2, a base generalist embedding model optimized for latency and quality. The following table summarizes the Voyage embedding models currently available on SageMaker JumpStart.

Voyage AI Model SageMaker JumpStart Model ID Description
voyage-2 voyage-2-embedding General-purpose embedding model optimized for a balance between cost, latency, and retrieval quality
voyage-large-2 voyage-large-2-embedding General-purpose embedding model optimized for retrieval quality
voyage-code-2 voyage-code-2-embedding Domain-specific embedding model optimized for code retrieval (17% better than alternatives)

In addition to general-purpose embedding models, Voyage AI offers domain-specific ones that are tuned to a particular domain. These domain-specific embedding models are trained on massive domain-specific datasets, allowing them to deeply understand and excel in that domain. For example, Voyage’s code embedding model (voyage-code-2) outperforms general-purpose embedding models on code-related data documents, achieving about a 15% improvement over the next best model. This performance gap over the next best general-purpose embedding improves even more for datasets requiring deeper code understanding. See voyage-code-2: Elevate Your Code Retrieval for voyage-code-2 details. More recently, Voyage released a legal embedding model (voyage-law-2) that is optimized for legal retrieval and tops the MTEB leaderboard for legal retrieval. See Domain-Specific Embeddings and Retrieval: Legal Edition (voyage-law-2) for voyage-law-2 details. Voyage AI plans to continue releasing additional domain-specific embedding models in the near future, including finance, healthcare, and multi-language. For a list of all available Voyage AI embedding models, see Embeddings.

Voyage AI offers API endpoints for embedding models, making it seamless to integrate with other components of your RAG stack. The Voyage AI embedding models are available on AWS Marketplace and deployable as Amazon SageMaker endpoints within your account and VPC, eliminating security and compliance concerns. As part of SageMaker JumpStart, you can deploy Voyage AI embedding models with a few clicks and start running your RAG stack on AWS.

Solution overview

In this RAG solution, we use Voyage AI embedding models deployed with SageMaker JumpStart to demonstrate an example using the Apple 2022 annual report (SEC Form 10-K) as the corpus to retrieve from. Specifically, we deploy the SageMaker model package of the voyage-large-2 model. For the LLM, we use the Anthropic Claude 3 Sonnet model on Amazon Bedrock. We use OpenSearch Service as the vector store. You can also follow along with the notebook. The following diagram illustrates the solution architecture.

SageMaker JumpStart is the machine learning (ML) hub of SageMaker that offers one-click access to over 350 open source and third-party models. These models can be discovered and deployed through the Amazon SageMaker Studio UI or using the SageMaker Python SDK. SageMaker JumpStart provides notebooks to customize and deploy foundation models into your VPC.

Anthropic’s Claude 3 models are the next generation of state-of-the-art models from Anthropic. For the vast majority of workloads, Sonnet is faster on inputs and outputs than Anthropic’s Claude 2 and 2.1 models, with higher levels of intelligence. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like Anthropic through an API, making it straightforward to build generative AI applications. To follow along, be sure to request model access to Anthropic Claude 3 Sonnet on Amazon Bedrock.

Amazon OpenSearch Service is a managed service that makes it straightforward to deploy, operate, and scale OpenSearch, a popular open source, distributed search analytics suite derived from Elasticsearch. OpenSearch provides the ability to do vector search via the k-NN search.

Prerequisites

To follow along, you need to create an OpenSearch Service domain. For the purposes of this walkthrough, the Easy create option is fine. Keep the Enable fine-grained access control option selected. Select Create master user and provide a user name and password. After the domain has been created, the domain details will have the domain endpoint, which you’ll need—along with the user name and password—to access your OpenSearch instance. You don’t need to worry about creating an index or inserting data. We use the OpenSearch Python client to work with our vector store in the walkthrough.

Deploy Embedding model endpoint

To use voyage-large-2, you need to subscribe to the SageMaker model package in AWS Marketplace. For instructions, see Subscribe to the model package. Choosing the model card in the SageMaker JumpStart UI will also bring you to the model listing page on AWS Marketplace.

After you’re subscribed, you can initialize and deploy the embedding model as a SageMaker endpoint as follows:

# Set embedding endpoint configuration
(embedding_model_id, embedding_model_version, embedding_instance_type) = (
    "voyage-large-2-embedding",
    "*",
    "ml.g5.xlarge",  # See AWS Marketplace model package for supported instance types
)

# Instantiate embedding model from JumpStart
from sagemaker.jumpstart.model import JumpStartModel

embedding_model = JumpStartModel(
    model_id=embedding_model_id,
    model_version=embedding_model_version,
    instance_type=embedding_instance_type,
)

# Deploy model as inference endpoint. This can take several minutes to deploy (5 to 10 minutes)
embedding_endpoint = embedding_model.deploy()

Vectorize Documents

With the embedding endpoint deployed, you can index your documents for retrieval.

Transform and chunk documents

You need a list of strings to invoke the deployed voyage-large-2 model. For many documents, like our example annual report, each string is a semantically meaningful chunk of text. There are several ways you can load and chunk documents for vectorization. The code in this section is just one example; feel free to use what suits your data source and files.

In this walkthrough, we load and chunk the source PDF file with the LangChain PyPDFLoader (which uses pypdf) and recursive character text splitter:

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("apple-10k-2022.pdf")
document_chunks = loader.load_and_split(
    RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
        length_function=len,
        is_separator_regex=False,
    )
)

In practice, selecting the text splitting chunk size and overlap requires some experimentation. The are many techniques for appropriately chunking documents for high-quality retrieval, but that is beyond the scope of this post.

Generate document embeddings

You can now vectorize your documents—or more precisely, your document chunks. See the following code:

# Set batch size
BATCH_SIZE = 45
In [ ]:
# Vectorize chunks in batches
index_list = []
for i in range(0, len(chunk_list), BATCH_SIZE):
    docs_playload = {
        "input": chunk_list[i:i + BATCH_SIZE],
        "input_type": "document",
        "truncation": "true",
    }

    embed_docs_response = embedding_endpoint.predict(json.dumps(docs_playload))

    doc_embeddings_list = [d["embedding"] for d in embed_docs_response["data"]]
    index_list += [
        {"document": document, "embedding": embedding} 
        for document, embedding in zip(chunk_list[i:i + BATCH_SIZE], doc_embeddings_list)
    ]

Create a vector store index

The next step is to populate your OpenSearch vector search index with the document embeddings using the OpenSearch Python client:

# Populate index with document, embedding, and ID
for id, i in zip(range(0, len(index_list)), index_list):
    index_response = opensearch_client.index(
        index=INDEX_NAME_OPENSEARCH,
        body={
            "document": i["document"],
            "embedding": i["embedding"],
        },
        id=id,
        refresh=True,
    )

Retrieve relevant documents

With your indexed vector store, you can now use embeddings to find relevant documents to your query:

# Set number of documents to retrieve
TOP_K = 3
In [ ]:
# Set vector search payload
vector_search_payload = {
    "size": TOP_K,
    "query": {"knn": {"embedding": {"vector": query_embedding, "k": TOP_K}}},
}
In [ ]:
vector_search_response = opensearch_client.search(
    index=INDEX_NAME_OPENSEARCH,
    body=vector_search_payload,
)

The following is a formatted semantic search result of the top three most-relevant document chunks, indicating the index ID, similarity score, and the first several characters of the chunk:

ID: 4
Score: 0.7956404
Document: under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report. ☒
Indicate by check mark whether the Registrant is a shell company (as defined in Rule 12b-2 of the Act).
Yes  ☐ 	No  ☒
The aggregate market value of the voting and non-voting stock held by non-affiliates of the Registrant, as of March 25, 2022, the last business day of the Registrant’s most recently completed second fiscal quarter, was approximately $2,830,067,000,000. Solely for purposes of this disclosure, shares of common stock held by executive officers and directors of the Registrant as of such date have been excluded because such persons may be deemed to be affiliates. This determination of executive officers and directors as affiliates is not necessarily a conclusive determination for any other purposes.  15,908,118,000 shares of common stock were issued and outstanding as of October 14, 2022.
 
ID: 5
Score: 0.7367379
Document: 15,908,118,000 shares of common stock were issued and outstanding as of October 14, 2022.
DOCUMENTS INCORPORATED BY  REFERENCE
Portions of the Registrant’s definitive proxy statement relating to its 2023 annual meeting of shareholders are incorporated by reference into Part III of this Annual Report on Form 10-K where indicated. The Registrant’s definitive proxy statement will be filed with the U.S. Securities and Exchange Commission within 120 days after the end of the fiscal year to which this report relates.
 
ID: 178
Score: 0.7263324
Document: Note 3 – Financial Instruments
Cash, Cash Equivalents and Marketable Securities
The following tables show the Company’ s cash, cash equivalents and marketable securities by significant investment category as of September 24, 2022 and September 25, 2021 (in millions):
2022
Adjusted Cost
Unrealized Gains
Unrealized Losses
Fair Value
Cash and Cash Equivalents
Current Marketable Securities
Non-Current Marketable Securities
Cash $ 18,546 $ — $ — $ 18,546 $ 18,546 $ — $ —
Level 1 :
Money market funds 2,929 — — 2,929 2,929 — —
Mutual funds 274 — (47) 227 — 227 —
Subtotal 3,203 — (47) 3,156 2,929 227 —
Level 2 :
U.S. Treasury securities 25,134 — (1,725) 23,409 338 5,091 17,980
U.S. agency securities 5,823 — (655) 5,168 — 240 4,928
Non-U.S. government securities 16,948 2 (1,201) 15,749 — 8,806 6,943  	Certificates of deposit and time deposits 2,067 — — 2,067 1,805 262 —
Commercial paper 718 — — 718 28 690 —
Corporate debt securities 87,148 9 (7,707) 79,450 — 9,023 70,427

The top retrieved document chunk (ID 4 with a score of 0.7956404) contains a statement that provides a direct answer to our query:

The aggregate market value of the voting and non-voting stock held by non-affiliates of the Registrant, as of March 25, 2022, the last business day of the Registrant’s most recently completed second fiscal quarter, was approximately $2,830,067,000,000.

This additional context will enable Claude to provide a response that answers your query.

Generate a retrieval augmented response

You can now prompt Claude to use the retrieved documents to answer your query:

# Create retrieval-augmented prompt
rag_prompt = f"""Human:

INSTRUCTIONS:
Answer the QUERY using the CONTEXT text provided below. Keep your answer
grounded in the facts of the CONTEXT. If the CONTEXT doesn’t contain the
facts to answer the QUERY just respond with "I do not have enough context
to respond to this query.".

QUERY: {query}

CONTEXT: {context}

Assistant:
"""

Next initialize the Amazon Bedrock client to invoke Anthropic’s Claude3 Sonnet model in us-east-1.

# List available LLMs on Amazon Bedrock
bedrock_client = boto3.client('bedrock', region_name='us-east-1')
bedrock_fm = bedrock_client.list_foundation_models()
print([(m["modelId"], m["modelName"]) for m in bedrock_fm["modelSummaries"]])
In [ ]:
# Set Bedrock LLM model ID
LLM_MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0"
In [ ]:
bedrock_runtime_client = boto3.client('bedrock-runtime', region_name='us-east-1')
In [ ]:
# Invoke LLM inference
try:
    llm_rag_response = bedrock_runtime_client.invoke_model(
        modelId=LLM_MODEL_ID,
        body=json.dumps(rag_payload),
        accept="application/json",
        contentType="application/json",
    )
except botocore.exceptions.ClientError as error:
    print("Client error")
    raise error

Claude produces a correct response that answers the query using the context provided:

According to the context provided, "The aggregate market value of the voting and non-voting stock held by non-affiliates of the Registrant, as of March 25, 2022, the last business day of the Registrant's most recently completed second fiscal quarter, was approximately $2,830,067,000,000."

Generate a response without retrieved documents

For comparison, let’s see how Claude will respond to a query directly—without any additional context. Specifically, we ask “What is the aggregate market value of the voting and non-voting stock held by non-affiliates of the Registrant, as of March 25, 2022?”

The following is Claude’s response:

Unfortunately, without access to the specific financial filings or reports for the company you are asking about, I do not have enough information to determine the aggregate market value of the voting and non-voting stock held by non-affiliates as of March 25, 2022.

Public companies are required to disclose this information in their annual reports and other SEC filings, but I do not have access to those documents for the particular company or "Registrant" you are referring to. The aggregate market value calculation takes into account things like the number of outstanding shares, recent stock prices, and which shareholders are considered affiliates versus non-affiliates.

Unless you are able to provide the company's name and the specific report containing this data point, I cannot lookup or calculate that aggregate market value figure. My role is to provide helpful information to you, but I do not have a way to access each public company's proprietary financial data directly. Please let me know if you can share more details about the company in question.

Claude indicates it does not have enough context, especially because the query didn’t explicitly indicate the registrant (the company).

Clean up

If you’ve been following along, don’t forget to delete the SageMaker endpoint and delete the OpenSearch Service domain when you’re done so you don’t incur unnecessary costs:

embedding_endpoint.delete_model()
embedding_endpoint.delete_endpoint()

Conclusion

Embeddings are at the heart of a RAG system, and Voyage AI offers the best general-purpose and domain-specific embedding models today. Get started or level up your existing RAG stack on AWS today with Voyage AI embedding models—it’s seamless with SageMaker JumpStart. You can try the notebook in this post yourself. Learn more about Voyage AI and follow them on X (Twitter) or LinkedIn for updates!


About the Authors

Tengyu Ma is CEO and Co-Founder of Voyage AI and an assistant professor of computer science at Stanford University. His research interests broadly include topics in machine learning, algorithms and their theory, such as deep learning, (deep) reinforcement learning, pre-training / foundation models, robustness, non-convex optimization, distributed optimization, and high-dimensional statistics. Tengyu earned his PhD from Princeton University and has worked at Facebook and Google as visiting scientists.

Wen Phan is Head of Product at Voyage AI and has spent the last decade developing and commercializing AI and data products for enterprises. He has worked with hundreds of users and organizations around the world to apply AI and data to their use cases in financial services, healthcare, defense, and technology, to name a few. Wen holds a B.S. in electrical engineering and M.S. in analytics and decision sciences. Personally, he enjoys spinning hip-hop records, dining out, and spending time with his wife and two kids — oh, and guzzling cookies and cream milkshakes, too!

Vivek Gangasani is an AI/ML Solutions Architect working with Generative AI startups on AWS. He helps world leading AI startups train, host and operationalize LLMs to build innovative Generative AI solutions. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance at scale for LLMs. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Read More

Incorporate offline and online human – machine workflows into your generative AI applications on AWS

Incorporate offline and online human – machine workflows into your generative AI applications on AWS

Recent advances in artificial intelligence have led to the emergence of generative AI that can produce human-like novel content such as images, text, and audio. These models are pre-trained on massive datasets and, to sometimes fine-tuned with smaller sets of more task specific data. An important aspect of developing effective generative AI application is Reinforcement Learning from Human Feedback (RLHF). RLHF is a technique that combines rewards and comparisons, with human feedback to pre-train or fine-tune a machine learning (ML) model. Using evaluations and critiques of its outputs, a generative model can continue to refine and improve its performance. The interplay between Generative AI and human input paves the way for more accurate and responsible applications. You can learn how to improve your LLMs with RLHF on Amazon SageMaker, see Improving your LLMs with RLHF on Amazon SageMaker.

Athough RLHF is the predominant technique for incorporating human involvement, it is not the only available human in the loop technique. RLHF is an offline, asynchronous technique, where humans provide feedback on the generated outputs, based on input prompts. Humans can also add value by intervening into an existing communication happening between generative AI and users. For instance, as decided by AI or desired by the user, a human can be called into an existing conversation and take over the discussion.

In this post, we introduce a solution for integrating a “near-real-time human workflow” where humans are prompted by the generative AI system to take action when a situation or issue arises. This can also be a ruled-based method that can determine where, when and how your expert teams can be part of generative AI – user conversations. The entire conversation in this use case, starting with generative AI and then bringing in human agents who take over, is logged so that the interaction can be used as part of the knowledge base. Together with RLHF, near-real-time human-in-the-loop methods enable the development of responsible and effective generative AI applications.

This blog post uses RLHF as an offline human-in-the-loop approach and the near-real-time human intervention as an online approach. We present the solution and provide an example by simulating a case where the tier one AWS experts are notified to help customers using a chat-bot. We use an Amazon Titan model on Amazon Bedrock to find the sentiment of the customer using a Q&A bot and then notifying about negative sentiment to a human to take the appropriate actions. We also have another expert group providing feedback using Amazon SageMaker GroundTruth on completion quality for the RLHF based training. We used this feedback to finetune the model deployed on Amazon Bedrock to power the chat-bot. We provide LangChain and AWS SDK code-snippets, architecture and discussions to guide you on this important topic.

SageMaker GroudTruth

SageMaker Ground Truth offers the most comprehensive set of human-in-the-loop capabilities, allowing you to harness the power of human feedback across the ML lifecycle to improve the accuracy and relevancy of models. You can complete a variety of human-in-the-loop tasks with SageMaker Ground Truth, from data generation and annotation to model review, customization, and evaluation, through either a self-service or an AWS-managed offering.

Amazon Bedrock

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon with a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. With Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that run tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

Example use-case

In this use case, we work with a generative AI powered Q&A bot, which answers questions about SageMaker. We built the RAG solution as detailed in the following GitHub repo and used SageMaker documentation as the knowledge base. You can build such chatbots following the same process. The interface of the Q&A looks like the following screenshot. Amazon SageMaker Sample and used Amazon SageMaker documentation as the knowledge base. You can easily build such chatbots following the same process. Eventually, the interface of the Q&A looks like in Figure 1.

UI and the Chatbot example application to test human-workflow scenario.

Figure 1. UI and the Chatbot example application to test human-workflow scenario.

In this scenario, we incorporate two human workflows to increase customer satisfaction. The first is to send the interactions to human experts to assess and provide scores. This is an offline process that is part of the RLHF. A second real-time human workflow is initiated as decided by the LLM. We use a simple notification workflow in this post, but you can use any real-time human workflow to take over the AI-human conversation.

Solution overview

The solution consists of three main modules:

  • Near real-time human engagement workflow
  • Offline human feedback workflow for RLHF
  • Fine-tuning and deployment for RLHF

The RLHF and real-time human engagement workflows are independent. Therefore, you can use either or both based on your needs. In both scenarios, fine-tuning is a common final step to incorporate these learnings into LLMs. In the following sections, we provide the details about incorporating these steps one by one and divide the solution into related sections for you to choose and deploy.

The following diagram illustrates the solution architecture and workflow.

Solutions architecture for human-machine workflow modules

Figure 2. Solutions architecture for human-machine workflow modules

Implementation

Prerequisites

Our solution is an add-on to an existing Generative AI application. In our example, we used a Q&A chatbot for SageMaker as explained in the previous section. However, you can also bring your own application. The blog post assumes that you have expert teams or workforce who performs reviews or join workflows.

Build a near real-time human engagement workflow workflow

This section presents how an LLM can invoke a human workflow to perform a predefined activity. We use AWS Step Functions which is a serverless workflow orchestration service that you can use for human-machine workflows. In our case, we call the human experts into action, in real time, but you can build any workflow following the tutorial Deploying an Example Human Approval Project.

Decision workflow to trigger real time human engagement

In this scenario, the customer interacts with the Q&A bot (Step-1 in the previous architecture diagram), and if the interaction shows strong negative sentiment, it will invoke a pre-existing human workflow (Step-2 in Figure 2). In our case, it is a simple email notification (Step-3 in Figure 2) but you can extend this interaction such as including the experts into the chat-zone to take over the conversation and more (Step-4 in Figure 2).

Before we dive deep into the solution, it is important to discuss the workflow logic. The following figure shows the details of the decision workflow. The interaction starts with a customer communication. Here, before the LLM provides an answer to the customer request, the prompt-chain starts with an internal prompt asking the LLM to go over the customer response and look for clear negative sentiment. This prompt and internal sentiment analysis are not visible to customer. This is an internal chain before proceeding with the next steps of which responses may be reflected to the customer based on your preference. If the sentiment is negative, the next step is to trigger a pre-built engagement human-workflow while the chatbot informs the customer about the extra support coming to help. Otherwise, if the sentiment is neutral or positive, the normal response to the customer request will be provided.

This workflow is a demonstrative example and you can add to or modify it as you prefer. For example, you can make any other decision check, not limited to sentiment. You can also prepare your own response to the customer with the right prompting the chain so that you can implement your designed customer experience. Here, our simple example demonstrates how you can easily build such prompt in chains and engage external existing workflows, in our case, it is a human-workflow using Amazon Bedrock. We also use the same LLM to respond to this internal sentiment prompt check for simplicity. However, you can include different LLMs, which might have been fine-tuned for specific tasks, such as sentiment analysis, so that you rely on a different LLM for the Q&A chatbot experience. Adding more serial steps into chains increases the latency because now the customer query or request is being processed more than once.

Real-time (online) human workflow triggered by LLM.

Figure 3. Real-time (online) human workflow triggered by LLM.

Implementing the decision workflow with Amazon Bedrock

To implement the decision workflow, we used Amazon Bedrock and its LangChain integrations. The prompt chain is run through SequentialChain from LangChain. Because our human workflow is orchestrated with Step Functions, we also use LangChain’s StepFunction library.

  1. First, define the LLM and prompt template:
    prompt = PromptTemplate(
    input_variables=["text"],
    template="{text}",)
    llm = Bedrock(model_id="amazon.titan-tg1-large")
    llmchain_toxic = LLMChain(llm=llm, prompt=prompt,output_key="response")

  2. Then you feed the response from the first LLM to the next LLM through an LLM chain, where the second instruct is to find the sentiment of the response. We also instruct the LLM to provide 0 as positive and 1 as negative response.
    templateResponseSentiment="""Find the sentiment of below sentence, respond 0 if positive and respond 1 if negative
    {response} """
    
    prompt_sentiment= PromptTemplate( input_variables=["response"], template = templateResponseSentiment)
    llmchain_sentiment= LLMChain(llm=llm, prompt=prompt_sentiment,output_key="sentiment")
    
    from langchain.chains import SequentialChain
    overall_chain = SequentialChain(chains=[llmchain_toxic, llmchain_sentiment], input_variables=["text"],output_variables=["response", "sentiment"],verbose=True)

  3. Run a sequential chain to find the sentiment:
    response= overall_chain({ "text": "Can you code for me for SageMaker" })
    print("response payload " + str(response))
    print("n response sentiment: " + response['sentiment'])

  4. If the sentiment is negative, the model doesn’t provide the response back to customer, instead it invokes a workflow that will notify a human in loop:
    if "1" in response_sentiment['sentiment'] : # 1 represents negative sentiment
    print('triggered workflow, check email of the human on notification and add to workflow anything else you may want')
    lambda_client = boto3.client('lambda')
    #create input - send the response from LLM and detected sentiment
    lambda_payload1="{"response": "" + response['text'] +"","response_sentiment": " + ""1"}"
    lambda_client.invoke(FunctionName='triggerWorkflow', InvocationType='Event', Payload=lambda_payload1)

If you choose to have your human experts join a chat with the users, you can add these interactions of your expert teams to your knowledge base. This way, when the same or similar issue is raised, the chatbot can use these in their answers. In this post, we did not show this method, but you can create a knowledge base in Amazon Bedrock to use these human-to-human interactions for future conversations in your chatbot.

Build an offline human feedback workflow

In this scenario, we assume that the chat transcripts are stored in an Amazon Simple Storage Service (Amazon S3) bucket in JSON format, a typical chat transcript format, for the human experts to provide annotations and labels on each LLM response. The transcripts are sent for a labeling task performed by a labeling workforce using Amazon SageMaker Ground Truth. However, in some cases, it’s impossible to label all the transcripts due to resource limitations. In these cases, you may want to randomly sample the transcripts or use a pattern that can be sent to the labeling workforce based on your business case.

Pre-annotation Lambda function
The process starts with an AWS Lambda function. The pre-annotation Lambda function is invoked based on chron job or based on an event or on-demand. Here, we use the on-demand option. SageMaker Ground Truth sends the Lambda function a JSON-formatted request to provide details about the labeling job and the data object. More information can be found here. Following is the code snippet for the pre-processing Lambda function:

import json
def lambda_handler(event, context):
return {
"taskInput": event['dataObject']
}

# JSON formatted request

{
"version": "2018-10-16",
"labelingJobArn": <labelingJobArn>
"dataObject" : {
"source-ref": <s3Uri where dataset containing the chabot responses are stored>
}
}

Custom workflow for SageMaker Ground Truth
The remaining part of sending the examples, UI, and storing the results of the feedback are performed by SageMaker Ground Truth and invoked by the pre-annotation Lambda function. We use the labeling job with the custom template option in SageMaker Ground Truth. The workflow allows labelers to rate the relevance of an answer to a question from 1–5, with 5 being the most relevant. Here, we assumed a conventional RLHF workflow where the labeling workforce provides the score based on their expectation from the LLM in this situation. The following code shows an example:

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
<crowd-classifier
name="relevance"
categories="['1', '2', '3', '4', '5']"
header="How relevant is the below answer to the question: {{ task.input.source }}"
>
<classification-target>
{{ task.input.source }}
</classification-target>
<full-instructions header="Conversation Relevance Instructions">
<h2>How relevant is the below answer to the given question?</h2>
</full-instructions>
<short-instructions>
How relevant is the below answer to the question: {{ task.input.source }}
</short-instructions>
</crowd-classifier>
</crowd-form>

In our scenario, we used the following UI for our labeling workers to score the complete response given for the prompt. This provides feedback on the answer to a question given by the chatbot, marking it as 1–5, with 5 being most the relevant answer to the question.

Two examples from RLHF feedback UI.Two examples from RLHF feedback UI.

Figure 4. Two examples from RLHF feedback UI.

Post annotation Lambda function
When all workers complete the labeling task, SageMaker Ground Truth invokes the post-annotation Lambda function with a pointer to the dataset object and the workers’ annotations. This post-processing Lambda function is generally used for annotation consolidation, which has SageMaker Ground Truth create a  manifest file and uploads it to an S3 bucket for persistently storing consolidated annotations. The following code shows the postprocessing Lambda function:

import json
import boto3
from urllib.parse import urlparse

def lambda_handler(event, context):
consolidated_labels = []

parsed_url = urlparse(event['payload']['s3Uri']);
s3 = boto3.client('s3')
textFile = s3.get_object(Bucket = parsed_url.netloc, Key = parsed_url.path[1:])
filecont = textFile['Body'].read()
annotations = json.loads(filecont);

for dataset in annotations:
for annotation in dataset['annotations']:
new_annotation = json.loads(annotation['annotationData']['content'])
label = {
'datasetObjectId': dataset['datasetObjectId'],
'consolidatedAnnotation' : {
'content': {
event['labelAttributeName']: {
'workerId': annotation['workerId'],
'result': new_annotation,
'labeledContent': dataset['dataObject']
}
}
}
}
consolidated_labels.append(label)

return consolidated_labels

You can use the output manifest file to further fine-tune your LLM model, as detailed in the next section. The following code is a snippet of the created manifest file:

JSON:

{"source":"what is amazon SageMaker?,AWS SageMaker is a machine learning service that allows you to train and deploy machine learning models in the cloud.","RHLF-custom-feedback":{"workerId":"private.us-east-1.8c185c045aed3bef","result":{"relevance":{"label":"5 - Highly Relevant"}},"labeledContent":{"content":"what is amazon SageMaker?,AWS SageMaker is a machine learning service that allows you to train and deploy machine learning models in the cloud."}},"RHLF-custom-feedback-metadata":{"type":"groundtruth/custom","job-name":"rhlf-custom-feedback","human-annotated":"yes","creation-date":"2023-08-09T02:46:05.852000"}}

Fine-tune the LLM using RLHF

To demonstrate RLHF in both near real-time and offline workflows, we collected 50 human-annotated samples using SageMaker Ground Truth. The data is used for RLHF training on a Flan-T5 XL model by PEFT/LoRA with 8-bit quantization:

from peft import LoraConfig

lora_config = LoraConfig(
r=32,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q", "v"],
bias="none",
task_type="SEQ_2_SEQ_LM",
)

The training uses the learning rate 1e-5 for 10 epochs, and the batch size = 1 to use one sample at a time.

learning_rate=1e-5
max_ppo_epochs=10
mini_batch_size=1
batch_size=1

config = PPOConfig(
model_name=model,
learning_rate=learning_rate,
ppo_epochs=max_ppo_epochs,
mini_batch_size=mini_batch_size,
batch_size=batch_size,
)

ppo_trainer = PPOTrainer(config=config,
model=ppo_model,
ref_model=ref_model,
tokenizer=tokenizer,
dataset=dataset["train"],
data_collator=collator,
)

Because there are only 50 human-annotated samples collected from SageMaker Ground Truth, it is not sufficient to train a reward model for reinforcement learning. Therefore, we decided to take the annotated evaluation score for each sample and use them as the reward value in the reinforcement learning process. This should be close enough to the reward value generated from a reward model. Our experiment showed that this method is effective for a small training set. You can see the curve of the training process in the following chart.

reward/mean chart

Figure 5. Reward/mean chart

After the training, we replaced the Flan-T5 foundation model in the AWS support chatbot with the RLHF trained model. In the following examples, you can observe that the response quality after RLHF is improved and the answers are more comprehensive and contain more useful information:

  • Question: How does SageMaker protect my data?
    Response before RLHF: SageMaker stores code in ML storage volumes
    Response after RLHF: SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.
  • Question: What is Amazon SageMaker?
    Response before RLHF: AWS SageMaker is a machine learning service that allows you to train and deploy machine learning models in the cloud.
    Response after RLHF: A fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.

Clean up

To clean up your resources, first start by stopping and deactivating any active human workflow or fine-tuning jobs. Removing the prompt chaining is a good start for de-coupling the workflows from your existing application. Then, continue by deleting the resources for the real-time human workflow manually. Finally, delete the RLHF resources. If you created a new Q&A chatbot application, then first stop and then delete the resources used for the Q&A chatbot part of the blogpost.

Conclusion

This post presented solutions for incorporating both offline and online human workflows into generative AI applications on AWS. The offline human feedback workflow uses SageMaker Ground Truth to collect human evaluations on chatbot responses. These evaluations are used to provide reward signals for fine-tuning the chatbot’s underlying language model with RLHF. The online human workflow uses LangChain and Step Functions to invoke real-time human intervention based on sentiment analysis of the chatbot responses. This allows human experts to seamlessly take over or step into conversations when the AI reaches its limits. This capability is important for implementations that require using your existing expert teams in critical, sensitive, or determined topics and themes. Together, these human-in-the-loop techniques, offline RLHF workflows, and online real-time workflows enable you to develop responsible and robust generative AI applications.

The provided solutions integrate multiple AWS services, like Amazon Bedrock, SageMaker, SageMaker Ground Truth, Lambda, Amazon S3, and Step Functions. By following the architectures, code snippets, and examples discussed in this post, you can start incorporating human oversight into your own generative AI applications on AWS. This paves the way towards higher-quality completions and building trustworthy AI solutions that complement and collaborate with human intelligence.

Building generative AI applications is effortless with Amazon Bedrock. We recommend starting your experiments following this Quick Start with Bedrock.


About the Authors

Tulip Gupta is a Senior Solutions Architect at Amazon Web Services. She works with Amazon media and entertainment (M&E) customers to design, build, and deploy technology solutions on AWS, and has a particular interest in Gen AI and machine learning focussed on M&E. She assists customers in adopting best practices while deploying solutions in AWS. Linkedin

BurakBurak Gozluku is a Principal AI/ML Specialist Solutions Architect located in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. Burak is still a research affiliate in MIT. Burak is passionate about yoga and meditation.

YunfeiYunfei bai is a Senior Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

RachnaRachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in future and bring economical and social prosperity. In her spare time, Rachna likes spending time with her family, hiking and listening to music.

Read More