Enable single sign-on access of Amazon SageMaker Canvas using AWS IAM Identity Center: Part 2

Enable single sign-on access of Amazon SageMaker Canvas using AWS IAM Identity Center: Part 2

Amazon SageMaker Canvas allows you to use machine learning (ML) to generate predictions without having to write any code. It does so by covering the end-to-end ML workflow: whether you’re looking for powerful data preparation and AutoML, managed endpoint deployment, simplified MLOps capabilities, or the ability to configure foundation models for generative AI, SageMaker Canvas can help you achieve your goals.

To enable agility for your users while ensuring secure environments, you can adopt single sign-on (SSO) using AWS IAM Identity Center, which is the recommended AWS service for managing user access to AWS resources. With IAM Identity Center, you can create or connect workforce users and centrally manage their access across all their AWS accounts and applications.

Part 1 of this series describes the necessary steps to configure SSO for SageMaker Canvas using IAM Identity Center for Amazon SageMaker Studio Classic.

In this post, we walk you through the necessary steps to configure SSO for SageMaker Canvas using IAM Identity Center for the updated Amazon SageMaker Studio. Your users can seamlessly access SageMaker Canvas with their credentials from IAM Identity Center without having to first go through the AWS Management Console. We also demonstrate how you can streamline user management with IAM Identity Center.

Solution overview

To configure SSO from IAM Identity Center, you need to complete the following steps:

  1. Enable IAM Identity Center using AWS Organizations
  2. Create a SageMaker Studio domain that uses IAM Identity Center for user authentication
  3. Create users or groups in IAM Identity Center
  4. Add users or groups to the SageMaker Studio domain

We will also show how to rename the SageMaker Studio application to clearly identify it as SageMaker Canvas, and how to access it using IAM Identity Center.

Enable IAM Identity Center

Follow these steps to connect SageMaker Canvas to IAM Identity Center:

  1. On the IAM Identity Center console, choose Enable.
  2. Choose Enable with AWS Organizations.
  3. Choose Edit to add an instance name.
  4. Enter a name for your instance (for this post, canvas-app).
  5. Choose Save changes.

Create the SageMaker Studio domain

In this section, we create SageMaker Studio domain and configure the authentication method as IAM Identity Center. Complete the following steps:

  1. On the SageMaker console, choose Domains.
  2. Choose Create domain.
  3. Choose Set up for organizations.
  4. Choose Set up.
  5. Enter a domain name of your choice (for this post, canvas-domain).
  6. Choose Next.
  7. Select AWS Identity Center.
  8. Choose Create a new role.
  9. Select the SageMaker Canvas permissions that you want to grant.

For more details about permissions, see Users and ML Activities.

  1. Specify one or more Amazon Simple Storage Service (Amazon S3) bucket.
  2. Choose Next.
  3. Select SageMaker Studio – New.
  4. Choose Next.

Next, you can provide VPC details for your network configuration.

  1. For this post, we select Public internet access.
  2. Choose your VPC, subnets, and security groups.
  3. Choose Next.
  4. Keep default storage configuration and choose Next.
  5. Choose Submit.

Wait for SageMaker domain status to change to InService.

Rename the SageMaker Studio application

Before we create a user, let’s rename the SageMaker Studio application name. This will allow users to quickly identify the SageMaker Canvas application when they log in through IAM Identity Center, where they may have access to multiple applications.

  1. On the IAM Identity Center console, choose Applications.
  2. Choose the SageMaker Studio application on the AWS managed tab.
  3. Choose Edit details on the Actions menu.
  4. For Display name, enter a name (for this post, Canvas).
  5. For Description, enter a description.
  6. Choose Save changes.

Create a user in IAM Identity Center

Now you can create users, and optionally, groups, that will be given access to SageMaker Canvas. For this post, we create a single user to demonstrate the process to provide access. However, groups are typically preferred for better user management, and to provision access in organizations.

A user group is a collection of users. Groups let you specify permissions for multiple users, which can make it more straightforward to manage the permissions for those users. For example, you could have a user group called business analysts and give that user group permission to SageMaker Canvas; all users in that group will have SageMaker Canvas access. If a new user joins your organization and needs access to SageMaker Canvas, you can add the user to the business analyst group. If a person changes jobs in your organization, instead of editing that user’s permissions, you can remove them from the old user groups and add them to the appropriate new user groups.

Complete the following steps to create a user in IAM Identity Center to test the SageMaker Canvas application access:

  1. On the IAM Identity Center console, choose Users in the navigation pane.
  2. Choose Add user.
  3. Provide required details such as the user name, email address, first name, and last name.
  4. Choose Next.
  5. Choose Add user.

You see a success message that the user has been added successfully.

Add users to the SageMaker Studio domain

You need to add this user to the SageMaker domain you created. If you’re using groups, then you add the group, not just a single user.

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain you created.
  3. Choose Assign users and groups.
  4. On the Users tab, select the user you created.
  5. Choose Assign users and groups.

Access the SageMaker Canvas application from IAM Identity Center

The user will receive an email with a link to set up a password and instructions to connect to the AWS access portal. The link will be valid for up to 7 days.

When the user receives the email, they must complete the following steps to gain access to SageMaker Canvas:

  1. Choose Accept invitation from the email.

  1. Set a new password to access SageMaker Canvas in the specified account and domain.

After authentication has been performed, the user has three options to log in to SageMaker Canvas:

  • Option 1 – Access from SageMaker Studio through the IAM Identity Center portal
  • Option 2 – Access from SageMaker Canvas through the IAM Identity Center portal, bypassing SageMaker Studio
  • Option 3 – Use the IAM Identity Center portal link in IAM Identity Center to access SageMaker Canvas

We go through each of these options in this section.

Option 1

In the first option, the user first accesses SageMaker Studio to access SageMaker Canvas. This option is appropriate for users that should be able to access all relevant applications from SageMaker Studio, including SageMaker Canvas.

  1. Navigate to the AWS access portal URL from your email.

  1. Log in with the credentials you set for the user.

You will see the application name you configured earlier.

  1. Choose the SageMaker Canvas application.

You’re redirected to SageMaker Studio.

  1. Choose Run Canvas.
  2. Choose Open Canvas.

You’re redirected to SageMaker Canvas.

Option 2

In this option, the user still goes through the IAM Identity Center portal, but bypasses SageMaker Studio to go directly into SageMaker Canvas. This option should be used when access SageMaker Studio is not needed, since the user’s SageMaker login will always take them directly to SageMaker Canvas.

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Note down the SageMaker domain ID.
  3. Open AWS CloudShell or any other CLI and run the following command, providing your domain ID. This command updates the default landing application for the SageMaker domain from SageMaker Studio to SageMaker Canvas:
    aws sagemaker update-domain --domain-id <SAGEMAKER DOMAIN ID> --default-user-settings '{"DefaultLandingUri":"app:Canvas:models","StudioWebPortal":"DISABLED"}'

You will see the following response if the command runs successfully.

  1. Navigate to the AWS access portal URL from your email.
  2. Log in with the credentials you set for the user.
  3. Choose the SageMaker Canvas application.

This time you’re redirected to SageMaker Canvas, bypassing SageMaker Studio.

Option 3

If the default landing application for the SageMaker domain has been updated from SageMaker Studio to SageMaker Canvas in Option 2, a user can also use the IAM Identity Center portal link to access SageMaker Canvas. To do so, choose the AWS access portal URL shown in the identity source on the IAM Identity Center console. You can use this URL as a browser bookmark, or integrated with your custom application for direct SageMaker Canvas access.

Clean up

To avoid incurring future session charges, log out of SageMaker Canvas.

Conclusion

In this post, we discussed how users can securely access SageMaker Canvas using SSO. To do this, we configured IAM Identity Center and linked it to the SageMaker domain where SageMaker Canvas is used. Users are now one click away from using SageMaker Canvas and solving new challenges with no-code ML. This approach supports the secure environment requirements of cloud engineering and security teams, while allowing for the agility and independence of development teams.

To learn more about SageMaker Canvas, check out Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts. SageMaker Canvas also enables collaboration with data science teams. To learn more, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas. For IT administrators, we suggest checking out Setting up and managing Amazon SageMaker Canvas (for IT administrators).


About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Dan Sinnreich is a Senior Product Manager at AWS, helping democratize ML with low-code/no-code innovations. Previous to AWS, Dan built and commercialized SaaS platforms and time series risk models used by institutional investors to manage risk and optimize investment portfolios. Outside of work, he can be found playing hockey, scuba diving, and reading science fiction.

Read More

Solar models from Upstage are now available in Amazon SageMaker JumpStart

Solar models from Upstage are now available in Amazon SageMaker JumpStart

This blog post is co-written with Hwalsuk Lee at Upstage.

Today, we’re excited to announce that the Solar foundation model developed by Upstage is now available for customers using Amazon SageMaker JumpStart. Solar is a large language model (LLM) 100% pre-trained with Amazon SageMaker that outperforms and uses its compact size and powerful track records to specialize in purpose-training, making it versatile across languages, domains, and tasks.

You can now use the Solar Mini Chat and Solar Mini Chat – Quant pretrained models within SageMaker JumpStart. SageMaker JumpStart is the machine learning (ML) hub of SageMaker that provides access to foundation models in addition to built-in algorithms to help you quickly get started with ML.

In this post, we walk through how to discover and deploy the Solar model via SageMaker JumpStart.

What’s the Solar model?

Solar is a compact and powerful model for English and Korean languages. It’s specifically fine-tuned for multi-turn chat purposes, demonstrating enhanced performance across a wide range of natural language processing tasks.

The Solar Mini Chat model is based on Solar 10.7B, with a 32-layer Llama 2 structure, and initialized with pre-trained weights from Mistral 7B compatible with the Llama 2 architecture. This fine-tuning equips it with the ability to handle extended conversations more effectively, making it particularly adept for interactive applications. It employs a scaling method called depth up-scaling (DUS), which is comprised of depth-wise scaling and continued pretraining. DUS allows for a much more straightforward and efficient enlargement of smaller models than other scaling methods such as mixture of experts (MoE).

In December 2023, the Solar 10.7B model made waves by reaching the pinnacle of the Open LLM Leaderboard of Hugging Face. Using notably fewer parameters, Solar 10.7B delivers responses comparable to GPT-3.5, but is 2.5 times faster. Along with topping the Open LLM Leaderboard, Solar 10.7B outperforms GPT-4 with purpose-trained models on certain domains and tasks.

The following figure illustrates some of these metrics:

With SageMaker JumpStart, you can deploy Solar 10.7B based pre-trained models: Solar Mini Chat and a quantized version of Solar Mini Chat, optimized for chat applications in English and Korean. The Solar Mini Chat model provides an advanced grasp of Korean language nuances, which significantly elevates user interactions in chat environments. It provides precise responses to user inputs, ensuring clearer communication and more efficient problem resolution in English and Korean chat applications.

Get started with Solar models in SageMaker JumpStart

To get started with Solar models, you can use SageMaker JumpStart, a fully managed ML hub service to deploy pre-built ML models into a production-ready hosted environment. You can access Solar models through SageMaker JumpStart in Amazon SageMaker Studio, a web-based integrated development environment (IDE) where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models.

On the SageMaker Studio console, choose JumpStart in the navigation pane. You can enter “solar” in the search bar to find Upstage’s solar models.

Figure - Search Solar model in Amazon SageMaker JumpStart

Let’s deploy the Solar Mini Chat – Quant model. Choose the model card to view details about the model such as the license, data used to train, and how to use the model. You will also find a Deploy option, which takes you to a landing page where you can test inference with an example payload.

Figure - How to deploy Solar Mode in SageMaker JumpStart

This model requires an AWS Marketplace subscription. If you have already subscribed to this model, and have been approved to use the product, you can deploy the model directly.

Figure - How to subscribe Solar model in AWS Marketplace

If you have not subscribed to this model, choose Subscribe, go to AWS Marketplace, review the pricing terms and End User License Agreement (EULA), and choose Accept offer.

Figure - Accept offer of Solar model in AWS Marketplace

After you’re subscribed to the model, you can deploy your model to a SageMaker endpoint by selecting the deployment resources, such as instance type and initial instance count. Choose Deploy and wait an endpoint to be created for the model inference. You can select an ml.g5.2xlarge instance as a cheaper option for inference with the Solar model.

Figure - Deploy SageMaker Inference endpoint

When your SageMaker endpoint is successfully created, you can test it through the various SageMaker application environments.

Run your code for Solar models in SageMaker Studio JupyterLab

SageMaker Studio supports various application development environments, including JupyterLab, a set of capabilities that augment the fully managed notebook offering. It includes kernels that start in seconds, a preconfigured runtime with popular data science, ML frameworks, and high-performance private block storage. For more information, see SageMaker JupyterLab.

Create a JupyterLab space within SageMaker Studio that manages the storage and compute resources needed to run the JupyterLab application.

Figure - Create a JupyterLab in SageMaker Studio

You can find the code showing the deployment of Solar models on SageMaker JumpStart and an example of how to use the deployed model in the GitHub repo. You can now deploy the model using SageMaker JumpStart. The following code uses the default instance ml.g5.2xlarge for the Solar Mini Chat – Quant model inference endpoint.

Solar models support a request/response payload compatible to OpenAI’s Chat completion endpoint. You can test single-turn or multi-turn chat examples with Python.

# Get a SageMaker endpoint
sagemaker_runtime = boto3.client("sagemaker-runtime")
endpoint_name = sagemaker.utils.name_from_base(model_name)

# Multi-turn chat prompt example
input = {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Can you provide a Python script to merge two sorted lists?"
      },
      {
        "role": "assistant",
        "content": """Sure, here is a Python script to merge two sorted lists:

                    ```python
                    def merge_lists(list1, list2):
                        return sorted(list1 + list2)
                    ```
                    """
      },
      {
        "role": "user",
        "content": "Can you provide an example of how to use this function?"
      }
    ]
}

# Get response from the model
response = sagemaker_runtime.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=json.dumps (input))
result = json.loads(response['Body'].read().decode())
print result

You have successfully performed a real time inference with the Solar Mini Chat model.

Clean up

After you have tested the endpoint, delete the SageMaker inference endpoint and delete the model to avoid incurring charges.

Figure - Delete the endpoint of SageMaker

You can also run the following code to delete the endpoint and mode in the notebook of SageMaker Studio JupyterLab:

# Delete the endpoint 
model.sagemaker_session.delete_endpoint(endpoint_name)
model.sagemaker_session.delete_endpoint_config(endpoint_name)

# Delete the model
model.delete_model()

For more information, see Delete Endpoints and Resources. Additionally, you can shut down the SageMaker Studio resources that are no longer required.

Conclusion

In this post, we showed you how to get started with Upstage’s Solar models in SageMaker Studio and deploy the model for inference. We also showed you how you can run your Python sample code on SageMaker Studio JupyterLab.

Because Solar models are already pre-trained, they can help lower training and infrastructure costs and enable customization for your generative AI applications.

Try it out on the SageMaker JumpStart console or SageMaker Studio console! You can also watch the following video, Try ‘Solar’ with Amazon SageMaker.

This guidance is for informational purposes only. You should still perform your own independent assessment, and take measures to ensure that you comply with your own specific quality control practices and standards, and the local rules, laws, regulations, licenses, and terms of use that apply to you, your content, and the third-party model referenced in this guidance. AWS has no control or authority over the third-party model referenced in this guidance, and does not make any representations or warranties that the third-party model is secure, virus-free, operational, or compatible with your production environment and standards. AWS does not make any representations, warranties, or guarantees that any information in this guidance will result in a particular outcome or result.


About the Authors

Photo - Channy YunChanny Yun is a Principal Developer Advocate at AWS, and is passionate about helping developers build modern applications on the latest AWS services. He is a pragmatic developer and blogger at heart, and he loves community-driven learning and sharing of technology.

Photo - Hwalsuk LeeHwalsuk Lee is Chief Technology Officer (CTO) at Upstage. He has worked for Samsung Techwin, NCSOFT, and Naver as an AI Researcher. He is pursuing his PhD in Computer and Electrical Engineering at the Korea Advanced Institute of Science and Technology (KAIST).

Photo - Brandon LeeBrandon Lee is a Senior Solutions Architect at AWS, and primarily helps large educational technology customers in the Public Sector. He has over 20 years of experience leading application development at global companies and large corporations.

Read More

Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2

Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2

This is a guest post co-written with Meta’s PyTorch team and is a continuation of Part 1 of this series, where we demonstrate the performance and ease of running PyTorch 2.0 on AWS.

Machine learning (ML) research has proven that large language models (LLMs) trained with significantly large datasets result in better model quality. In the last few years, the size of current generation models has increased significantly, and they require modern tools and infrastructure to be trained efficiently and at scale. PyTorch Distributed Data Parallelism (DDP) helps process data at scale in a simple and robust manner, but it requires the model to fit on one GPU. The PyTorch Fully Sharded Data Parallel (FSDP) library breaks this barrier by enabling model sharding to train large models across data parallel workers.

Distributed model training requires a cluster of worker nodes that can scale. Amazon Elastic Kubernetes Service (Amazon EKS) is a popular Kubernetes-conformant service that greatly simplifies the process of running AI/ML workloads, making it more manageable and less time-consuming.

In this blog post, AWS collaborates with Meta’s PyTorch team to discuss how to use the PyTorch FSDP library to achieve linear scaling of deep learning models on AWS seamlessly using Amazon EKS and AWS Deep Learning Containers (DLCs). We demonstrate this through a step-by-step implementation of training 7B, 13B, and 70B Llama2 models using Amazon EKS with 16 Amazon Elastic Compute Cloud (Amazon EC2) p4de.24xlarge instances (each with 8 NVIDIA A100 Tensor Core GPUs and each GPU with 80 GB HBM2e memory) or 16 EC2 p5.48xlarge instances (each with 8 NVIDIA H100 Tensor Core GPUs and each GPU with 80 GB HBM3 memory), achieving near linear scaling in throughput and ultimately enabling faster training time.

The following scaling chart shows that the p5.48xlarge instances offer 87% scaling efficiency with FSDP Llama2 fine-tuning in a 16-node cluster configuration.

Challenges of training LLMs

Businesses are increasingly adopting LLMs for a range of tasks, including virtual assistants, translation, content creation, and computer vision, to enhance the efficiency and accuracy in a variety of applications.

However, training or fine-tuning these large models for a custom use case requires a large amount of data and compute power, which adds to the overall engineering complexity of the ML stack. This is also due to limited memory available on a single GPU, which restricts the size of the model that can be trained, and also limits the per-GPU batch size used during training.

To address this challenge, various model parallelism techniques such as DeepSpeed ZeRO and PyTorch FSDP were created to allow you to overcome this barrier of limited GPU memory. This is done by adopting a sharded data parallel technique, where each accelerator holds just a slice (a shard) of a model replica instead of the entire model replica, which dramatically reduces the memory footprint of the training job.

This post demonstrates how you can use PyTorch FSDP to fine-tune the Llama2 model using Amazon EKS. We achieve this by scaling out compute and GPU capacity to address the model requirements.

FSDP overview

In PyTorch DDP training, each GPU (referred to as a worker in the context of PyTorch) holds a complete copy of the model, including the model weights, gradients, and optimizer states. Each worker processes a batch of data and, at the end of the backward pass, uses an all-reduce operation to synchronize gradients across different workers.

Having a replica of the model on each GPU restricts the size of the model that can be accommodated in a DDP workflow. FSDP helps overcome this limitation by sharding model parameters, optimizer states, and gradients across data parallel workers while still preserving the simplicity of data parallelism.

This is demonstrated in the following diagram, where in the case of DDP, each GPU holds a complete copy of the model state, including the optimizer state (OS), gradients (G), and parameters (P): M(OS + G + P). In FSDP, each GPU holds only a slice of the model state, including the optimizer state (OS), gradients (G), and parameters (P): M<partition number>(OS + G + P). Using FSDP results in a significantly smaller GPU memory footprint compared to DDP across all workers, enabling the training of very large models or using larger batch sizes for training jobs.

This, however, comes at the cost of increased communication overhead, which is mitigated through FSDP optimizations such as overlapping communication and computation processes with features like pre-fetching. For more detailed information, refer to Getting Started with Fully Sharded Data Parallel (FSDP).

FSDP offers various parameters that allow you to tune the performance and efficiency of your training jobs. Some of the key features and capabilities of FSDP include:

  • Transformer wrapping policy
  • Flexible mixed precision
  • Activation checkpointing
  • Various sharding strategies to suit different network speeds and cluster topologies:
    • FULL_SHARD – Shard model parameters, gradients, and optimizer states
    • HYBRID_SHARD – Full shard within a node DDP across nodes; supports a flexible sharding group for a full replica of the model (HSDP)
    • SHARD_GRAD_OP – Shard only gradients and optimizer states
    • NO_SHARD – Similar to DDP

For more information about FSDP, refer to Efficient Large-Scale Training with Pytorch FSDP and AWS.

The following figure shows how FSDP works for two data parallel processes.

Solution overview

In this post, we set up a compute cluster using Amazon EKS, which is a managed service to run Kubernetes in the AWS Cloud and on-premises data centers. Many customers are embracing Amazon EKS to run Kubernetes-based AI/ML workloads, taking advantage of its performance, scalability, reliability, and availability, as well as its integrations with AWS networking, security and other services.

For our FSDP use case, we use the Kubeflow Training Operator on Amazon EKS, which is a Kubernetes-native project that facilitates fine-tuning and scalable distributed training for ML models. It supports various ML frameworks, including PyTorch, which you can use to deploy and manage PyTorch training jobs at scale.

Utilizing the PyTorchJob custom resource of Kubeflow Training Operator, we run training jobs on Kubernetes with a configurable number of worker replicas which allows us to optimize resource utilization.

The following are a few components of the training operator that play a role in our Llama2 fine-tuning use case:

  • A centralized Kubernetes controller that orchestrates distributed training jobs for PyTorch.
  • PyTorchJob, a Kubernetes custom resource for PyTorch, provided by the Kubeflow Training Operator, to define and deploy Llama2 training jobs on Kubernetes.
  • etcd, which is related to the implementation of the rendezvous mechanism for coordinating the distributed training of PyTorch models. Thisetcdserver, as part of the rendezvous process, facilitates the coordination and synchronization of the participating workers during distributed training.

The following diagram illustrates the solution architecture.

Most of the details will be abstracted by the automation scripts that we use to run the Llama2 example.

We use the following code references in this use case:

What is Llama2?

Llama2 is a LLM pre-trained on 2 trillion tokens of text and code. It is one of the largest and most powerful LLMs available today You can use Llama2 for a variety of tasks, including natural language processing (NLP), text generation, and translation. For more information, refer to Getting started with Llama.

Llama2 is available in three different model sizes:

  • Llama2-70b – This is the largest Llama2 model, with 70 billion parameters. It is the most powerful Llama2 model and can be used for the most demanding tasks.
  • Llama2-13b – This is a medium-sized Llama2 model, with 13 billion parameters. It is a good balance between performance and efficiency, and can be used for a variety of tasks.
  • Llama2-7b – This is the smallest Llama2 model, with 7 billion parameters. It is the most efficient Llama2 model, and can be used for tasks that don’t require the highest level of performance.

This post enables you to fine-tune all of these models on Amazon EKS. To provide a simple and reproducible experience of creating an EKS cluster and running FSDP jobs on it, we use the aws-do-eks project. The example will also work with a pre-existing EKS cluster.

A scripted walkthrough is available on GitHub for an out-of-the-box experience. In the following sections, we explain the end-to-end process in more detail.

Provision the solution infrastructure

For the experiments described in this post, we use clusters with p4de (A100 GPU) and p5 (H100 GPU) nodes.

Cluster with p4de.24xlarge nodes

For our cluster with p4de nodes, we use the following eks-gpu-p4de-odcr.yaml script:

export ODCR_ID=<your-capacityreservation-id>

cat > ./eks-gpu-p4de-odcr.yaml <<EOF
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: do-eks-yaml-p4de-odcr
  version: "1.28"
  region: us-east-1
  tags:
    karpenter.sh/discovery: do-eks-yaml-p4de-odcr
availabilityZones:
  - us-east-1a
  - us-east-1b
  - us-east-1c
  - us-east-1d
managedNodeGroups:
  - name: sys
    instanceType: c5.2xlarge
    desiredCapacity: 1
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
nodeGroups:
  - name: p4de-odcr
    instanceType: p4de.24xlarge
    instancePrefix: p4de-odcr
    privateNetworking: true
    availabilityZones:
      - us-east-1c
    efaEnabled: true
    minSize: 0
    desiredCapacity: 2
    maxSize: 64
    volumeSize: 500
    capacityReservation:
      capacityReservationTarget:
        capacityReservationID: $ODCR_ID
    iam:
      withAddonPolicies:
        cloudWatch: true
        ebs: true
        fsx: true
iam:
  withOIDC: true
EOF

Using eksctl and the preceding cluster manifest, we create a cluster with p4de nodes:

eksctl create cluster -f ./eks-gpu-p4de-odcr.yaml

Cluster with p5.48xlarge nodes

A terraform template for an EKS cluster with P5 nodes is located in the following GitHub repo.

You can customize the cluster via the variables.tf file and then create it via the Terraform CLI:

terraform init && terraform plan -out tfplan && terraform apply tfplan

You can verify the cluster availability by running a simple kubectl command:

kubectl get nodes

The cluster is healthy if the output of this command shows the expected number of nodes in Ready status.

Deploy prerequisites

To run FSDP on Amazon EKS, we use the PyTorchJob custom resource. It requires etcd and Kubeflow Training Operator as prerequisites.

Deploy etcd with the following code:

kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/etcd/etcd-deployment.yaml

Deploy Kubeflow Training Operator with the following code:

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

Build and push an FSDP container image to Amazon ECR

Use the following code to build an FSDP container image and push it to Amazon Elastic Container Registry (Amazon ECR):

# Download Dockerfile
curl -L -o ./Dockerfile.llama2-efa https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/distributed-training/pytorch/pytorchjob/fsdp/Dockerfile.llama2-efa

# Build Image
AWS_REGION=$(aws configure get region)
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
REGISTRY=${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
IMAGE=fsdp
TAG=":llama2-efa"

docker build --progress=plain -t ${REGISTRY}${IMAGE}${TAG} -f ./Dockerfile.llama2-efa .

# Log in to ECR, create registry, push image
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
aws ecr create-repository --repository-name ${IMAGE}
docker image push ${REGISTRY}${IMAGE}${TAG}

Create the FSDP PyTorchJob manifest

Insert your Hugging Face token in the following snippet prior to running it:

HF_TOKEN=”<insert_your_huggingface_token_here>”

Configure your PyTorchJob with .env file or directly in your environment variables as below:

JOB_NAME=fsdp
RDZV_HOST=etcd
RDZV_PORT=2379
NUM_WORKERS=2
INSTANCE_TYPE=p5.48xlarge
GPU_PER_WORKER=8
EFA_PER_WORKER=32
MODEL_NAME=meta-llama/Llama-2-7b-hf

CMD="huggingface-cli login --token ${HF_TOKEN} && torchrun --nproc_per_node=${GPU_PER_WORKER} --nnodes=${NUM_WORKERS} examples/finetuning.py --num_epochs=5 --batch_size_training=3 --enable_fsdp --model_name $MODEL_NAME --output_dir ."

Generate the PyTorchJob manifest using the fsdp template and generate.sh script or create it directly using the script below:

cat > ./fsdp.yaml <<EOF
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: $JOB_NAME
spec:
  elasticPolicy:
    rdzvBackend: etcd
    rdzvHost: $RDZV_HOST
    rdzvPort: $RDZV_PORT
    minReplicas: 1
    maxReplicas: 64
    maxRestarts: 100
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 90
  pytorchReplicaSpecs:
    Worker:
      replicas: $NUM_WORKERS
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            app: $JOB_NAME
        spec:
          volumes:
            - name: shmem
              hostPath:
                path: /dev/shm
          nodeSelector:
            node.kubernetes.io/instance-type: '${INSTANCE_TYPE}'
          containers:
            - name: pytorch
              image: '${REGISTRY}${IMAGE}${TAG}'
              imagePullPolicy: Always
              resources:
                requests:
                  nvidia.com/gpu: $GPU_PER_WORKER
                  vpc.amazonaws.com/efa: $EFA_PER_WORKER
                limits:
                  nvidia.com/gpu: $GPU_PER_WORKER
                  vpc.amazonaws.com/efa: $EFA_PER_WORKER
              env:
                - name: LOGLEVEL
                  value: DEBUG
                - name: NCCL_DEBUG
                  value: INFO
                - name: TORCH_NCCL_ASYNC_ERROR_HANDLING
                  value: '1'
              command:
                - bash
                - '-c'
                - '${CMD}'
              volumeMounts:
                - name: shmem
                  mountPath: /dev/shm
EOF

Run the PyTorchJob

Run the PyTorchJob with the following code:

kubectl apply -f ./fsdp.yaml

You will see the specified number of FDSP worker pods created and, after pulling the image, they will enter into a Running state.

To see the status of the PyTorchJob, use the following code:

kubectl describe -f ./fsdp.yaml

To stop the PyTorchJob, use the following code:

kubectl delete -f ./fsdp.yaml

After a job is complete, it needs to be deleted before initiating a new run. We’ve also observed that deleting theetcdpod and letting it restart prior to launching a new job helps avoid a RendezvousClosedError.

Scale the cluster

You can repeat the preceding steps of creating and running jobs while varying the number and instance type of worker nodes in the cluster. This enables you to produce scaling charts like the one shown earlier. In general, you should see a reduction in GPU memory footprint, reduction in epoch time, and increase in throughput when more nodes are added to the cluster. The previous chart was produced by conducting several experiments using a p5 node group varying from 1–16 nodes in size.

Observe the FSDP training workload

Observability of generative artificial intelligence workloads is important to allow visibility into your running jobs as well as aid in maximizing the utilization of your compute resources. In this post, we use a few Kubernetes-native and open source observability tools for this purpose. These tools enable you to track errors, statistics, and model behavior, making AI observability a crucial part of any business use case. In this section, we show various approaches for observing FSDP training jobs.

Worker pod logs

At the most basic level, you need to be able to see the logs of your training pods. This can easily be done by using Kubernetes-native commands.
First, retrieve a list of pods and locate the name of the one that you want to see logs for:

kubectl get pods

Then view the logs for the selected pod:

kubectl logs -f <pod_name>

Only one worker (elected leader) pod log will list the overall job statistics. The name of the elected leader pod is available at the beginning of each worker pod log, identified by the key master_addr=.

CPU utilization

Distributed training workloads require both CPU and GPU resources. To optimize these workloads, it’s important to understand how these resources are utilized. Fortunately, some great open source utilities are available that help visualize CPU and GPU utilization. For viewing CPU utilization, you can usehtop. If your worker pods contain this utility, you can use the below command to open a shell into a pod and then runhtop.

kubectl exec -it <pod_name> -- bash

Alternatively, you can deploy an htopdaemonsetlike the one provided in the following GitHub repo.

Thedaemonsetwill run a lightweight htop pod on each node. You can exec into any of these pods and run thehtopcommand:

kubectl exec -it <htop_pod_name> -- htop

The following screenshot shows the CPU utilization on one of the nodes in the cluster. In this case, we are looking at a P5.48xlarge instance, which has 192 vCPUs. The processor cores are idle while the model weights are downloaded, and we see rising utilization while the model weights are being loaded to GPU memory.

GPU utilization

If thenvtoputility is available in your pod, you may exec into it using below and then runnvtop.

kubectl exec -it <pod_name> -- bash

Alternatively, you can deploy a nvtopdaemonsetlike the one provided in the following GitHub repo.

This will run anvtoppod on each node. You can exec into any of those pods and runnvtop:

kubectl exec -it <nvtop_pod_name> -- nvtop

The following screenshot shows the GPU utilization on one of the nodes in the training cluster. In this case, we are looking at a P5.48xlarge instance, which has 8 NVIDIA H100 GPUs. The GPUs are idle while the model weights are downloaded, then GPU memory utilization increases as the model weights are loaded onto the GPU, and GPU utilization spikes to 100% while the training iterations are underway.

Grafana dashboard

Now that you understand how your system works on the pod and node level, it’s also important to look at metrics at the cluster level. Aggregated utilization metrics can be collected by NVIDIA DCGM Exporter and Prometheus and visualized in Grafana.

An example Prometheus-Grafana deployment is available in the following GitHub repo.

An example DCGM exporter deployment is available in the following GitHub repo.

A simple Grafana dashboard is shown in the following screenshot. It was built by selecting the following DCGM metrics: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_MEM_COPY_UTIL, DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_GPU_TEMP, and DCGM_FI_DEV_POWER_USAGE. The dashboard can be imported into Prometheus from GitHub.

The following dashboard shows one run of a Llama2 7b single epoch training job. The graphs show that as the streaming multiprocessor (SM) clock increases, the power draw and temperature of the GPUs increase as well, together with GPU and memory utilization. You can also see that there were no XID errors and the GPUs were healthy during this run.

Since March 2024 GPU observability for EKS is supported natively in CloudWatch Container Insights. To enable this functionality just deploy the CloudWatch Observability Add-on in your EKS cluster. Then you will be able to browse pod, node, and cluster level metrics through pre-configured and customizable dashboards in Container Insights.

Clean up

If you created your cluster using the examples provided in this blog, you can execute the following code to delete the cluster and any resources associated with it, including the VPC:
For eksctl:

eksctl delete cluster -f ./eks-gpu-p4de-odcr.yaml

For terraform:

terraform destroy

Upcoming features

FSDP is expected to include a per-parameter sharding feature, aiming to further improve its memory footprint per GPU. Additionally, the ongoing development of FP8 support aims to improve FSDP performance on H100 GPUs. Finally, when FSDP is integrated withtorch.compile, we hope to see additional performance improvements and enablement of features like selective activation checkpointing.

Conclusion

In this post, we discussed how FSDP reduces the memory footprint on each GPU, enabling the training of larger models more efficiently and achieving near linear scaling in throughput. We demonstrated this through a step-by-step implementation of training a Llama2 model using Amazon EKS on P4de and P5 instances and used observability tools like kubectl, htop, nvtop, and dcgm to monitor logs, as well as CPU and GPU utilization.

We encourage you to take advantage of PyTorch FSDP for your own LLM training jobs. Get started at aws-do-fsdp.


About the Authors

Kanwaljit Khurmi is a Principal AI/ML Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their machine learning solutions on AWS. Kanwaljit specializes in helping customers with containerized, distributed computing and deep learning applications.

Alex Iankoulski is a Principal Solutions Architect, Self-managed Machine Learning at AWS. He’s a full-stack software and infrastructure engineer who likes to do deep, hands-on work. In his role, he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges.

Ana Simoes is a Principal Machine Learning Specialist, ML Frameworks at AWS. She supports customers deploying AI, ML, and generative AI at a large scale on HPC infrastructure in the cloud. Ana focuses on supporting customers to achieve price-performance for new workloads and use cases for generative AI and machine learning.

Hamid Shojanazeri is a Partner Engineer at PyTorch working on open source, high-performance model optimization, distributed training (FSDP), and inference. He is the co-creator of llama-recipe and contributor to TorchServe. His main interest is to improve cost-efficiency, making AI more accessible to the broader community.

Less Wright is an AI/Partner Engineer in PyTorch. He works on Triton/CUDA kernels (Accelerating Dequant with SplitK work decomposition); paged, streaming, and quantized optimizers; and PyTorch Distributed (PyTorch FSDP).

Read More

Provide live agent assistance for your chatbot users with Amazon Lex and Talkdesk cloud contact center

Provide live agent assistance for your chatbot users with Amazon Lex and Talkdesk cloud contact center

Amazon Lex provides advanced conversational artificial intelligence (AI) capabilities to enable self-service support for your organization’s contact center. With Amazon Lex, you can implement an omnichannel strategy where customers engage via phone, websites, and messaging platforms. The bots can answer FAQs, provide self-service experiences, or triage customer requests before transferring to a human agent. Amazon Lex integrates with state-of-the-art contact centers including Amazon Connect, Genesys Cloud, and Amazon Chime SDK to facilitate a seamless omnichannel experience.

This is the second post of a two-part series. The integration of Amazon Lex with Talkdesk cloud contact center is inspired by WaFd Bank (WaFd)’s digital innovation journey to enhance customer experience. In our previous post, we described how Amazon Lex integrates with the Talkdesk cloud contact center for the voice channel. In this post, we are focusing on the chat channel to show how to use Amazon Lex and the Amazon Lex Web UI to enable live agents to interact with your customers in real time. For example, the following figure shows screenshots of a chatbot transitioning a customer to a live agent chat (courtesy of WaFd Bank).

Solution overview

The following diagram illustrates the solution architecture.

In the preceding architecture, the following sequence of steps takes place in a live customer/agent conversation:

  1. Using the Amazon Lex Web UI, a customer asks to be connected to an agent. The associated Amazon Lex chatbot is configured with an escalation intent to process the incoming agent assistance request.
  2. The Amazon Lex fulfillment AWS Lambda function retrieves the Talkdesk touchpoint ID and Talkdesk OAuth secrets from AWS Secrets Manager and initiates a request to Talkdesk Digital Connect using the Start a Conversation API. In the payload, the function includes information that may be useful to an agent, such as the customer sentiment or the history of previously traversed intents.
  3. If the request to the Talkdesk API is successful, a Talkdesk conversation ID is returned to Amazon Lex.
  4. The Amazon Lex fulfillment Lambda function stores the conversation ID in Amazon Lex session attributes, thus making the conversation ID accessible to the Amazon Lex Web UI.
  5. The Amazon Lex Web UI opens a communication session with agents on the Talkdesk contact center through a WebSocket API in Amazon API Gateway.
  6. The Lambda associated with the WebSocket API first stores the Talkdesk conversation ID to WebSocket client ID mappings in Amazon DynamoDB. Then, through the Talkdesk Send a Message API, the Lambda function sends the customer’s message to the agent on Talkdesk contact center.
  7. Your agent responds to the customer with a message sent through the callback Rest API in API Gateway. The payload includes the conversation ID of the active conversation.
  8. The callback Rest API is configured to support the agents’ incoming messages as well as the agent’s closing of the conversation. In order to send the agent’s message to the customer, the supporting Lambda function reads the WebSocket client ID associated to the conversation ID from the DynamoDB table. This makes sure the agent’s message is delivered to the appropriate WebSocket client ID.
  9. The agent’s response is displayed through the Amazon Lex Web UI and the customer responds or closes the chat as appropriate. Steps 6–9 are repeated as long as the conversation remains active. If the agent ends the conversation, the customer is notified and the WebSocket connection is closed.

In the following sections, we walk you through the steps to build the solution architecture. Dependencies among each step are cross-referenced.

Prerequisites

To implement the solution presented in this post, you should first familiarize yourself with the following AWS services and features:

Additionally, you should be familiar with the following Talkdesk services:

Prepare your Talkdesk instance for the Amazon Lex Web UI chat with an agent

This section outlines the basic steps required to configure the Talkdesk chat with agent experience using the Talkdesk Digital Connect channel. Review Talkdesk APIs for further details for any additional tasks that may be required as part of your specific implementation.

Complete the following steps:

  1. Enable Talkdesk Digital Connect on your Talkdesk instance.
  2. Configure your agents’ accounts and assign them to the agents’ queues.
  3. Build a Talkdesk Studio flow.

This will be used to send chat users to an inbox for agents to assign. A sample is provided with this solution.

  1. To create an integration for your Amazon Lex Web UI instance, in the Talkdesk Builder navigation pane, select Integrations.
  2. On the Actions tab, configure three actions using the input and output schemas provided through the following links:

  1. Create a Talkdesk Digital Connect Touchpoint.
  2. Name the Touchpoint Lex Web UI Chat and record the Touchpoint ID.

This will be stored in Secrets Manager as dev/talkdesk/touchpoint/ids.

  1. In Talkdesk Builder, choose OAuth Clients in the navigation pane to set up OAuth credentials.
  2. Select Grant type for Client credentials and set Scope to digital-connect:write.
  3. Record the client ID and secret key from the Keys tab.

These will be stored in Secrets Manager as dev/talkdesk/client/keys and used to authenticate and communicate with the Talkdesk API.


  1. In your AWS account, store the two secrets in Secrets Manager.

The following screenshot shows the details of the Touchpoint ID as a Secrets Manager secret.

The following screenshot shows the details of the client ID as a Secrets Manager secret.

Deploy the Talkdesk Amazon Lex CloudFormation template

The following AWS CloudFormation template creates all the resources of the solution architecture. This includes all necessary IAM roles to invoke API operations, run associated Lambda functions, access secrets on Secrets Manager, and store and retrieve conversation ID and WebSocket client ID pairs from DynamoDB.

To facilitate monitoring and debugging, a CloudWatch log group is created for each of the resources.

The CloudFormation template provides additional details for each of the resources.

Complete the following steps to deploy the template:

  1. Sign in to the AWS Management Console.
  2. Choose Launch Stack for your AWS Region to begin the CloudFormation stack creation process.
    US East (N. Virginia)
    US West (Oregon)
    Asia Pacific (Singapore)
    Asia Pacific (Sydney)
    Asia Pacific (Tokyo)
    Europe (Frankfurt)
    Europe (Ireland)
    Europe (London)
  3. For Stack name, enter a name.
  4. For TDAUTHHOST, enter the URL of your Talkdesk instance.
  5. Leave the other parameters as default and choose Next
  6. Select the acknowledgement check boxes and choose Create stack.
  7. After the CloudFormation template is complete, record the values for the following keys on the Outputs tab to use in later steps:
    • APIGatewayApiKey
    • BotAliasId
    • BotId
    • CallbackRestAPI
    • WebSocketAPIEndpoint

Update the Talkdesk instance

Log in to your Talkdesk instance and complete the following steps to update your instance:

  1. In Talkdesk Builder, select Integrations in the navigation pane.
  2. On the Settings tab, locate Base path and enter the callback Rest API URL you recorded earlier.
  3. Under Other settings, set x-api-key to the value of the API Gateway key.

Deploy the Amazon Lex Web UI

The solution outlined in this post uses the Amazon Lex Web UI, a full-featured web client to deploy your Amazon Lex chatbot on your website. With the Amazon Lex Web UI, you can quickly bring your chatbot-powered application to life while minimizing time-to-value.

  1. Choose Launch Stack for the Region in which you will use your chatbot:
    US East (N. Virginia)
    US West (Oregon)
    Asia Pacific (Singapore)
    Asia Pacific (Sydney)
    Asia Pacific (Tokyo)
    Europe (Frankfurt)
    Europe (Ireland)
    Europe (London)
  2. For LexV2BotId, enter the value for BotId.
  3. For LexV2BotAliasId, enter the value for BotAliasId.
  4. Launch the stack.
  5. When deployment is complete, locate the Amazon Simple Storage Service (Amazon S3) URL for WebAppBucket.
  6. Navigate to the S3 bucket on the Amazon S3 console and download the lex-web-ui-loader-config.json file.
  7. Open the file and modify or add the following parameters:
    1. In the connect configuration section, add the new parameter talkDeskWebsocketEndpoint and set its value to the WebSocket endpoint.
    2. In the UI configuration section, set enableLiveChat to true.

  8. Upload the modified lex-web-ui-loader-config.json file and overwrite the previous version of the file in the S3 bucket.
  9. Return to the CloudFormation stack Outputs tab and find the WebAppDomainName link.

This will redirect you to a full-page version of the Amazon Lex Web UI. From here, you can test the Talkdesk integration and confirm that the bot is able to connect to Talkdesk using the WebSocket connection.

Test the solution

Now you’re ready to try the Amazon Lex and Talkdesk chat interaction:

  1. Start your Banking Bot chat window using the WebAppUrl provided as output in the CloudFormation stack.
  2. Log in to your Talkdesk Digital Connect channel and navigate to Conversations.
  3. In the Banking Bot chat window, request to talk to an agent.
  4. Watch the customer’s message being delivered to the Talkdesk Conversations Inbox.
  5. The Talkdesk agent self-assigns the conversation and starts engaging with the customer.

The following video demonstrates the chat experience.

Clean up

To clean up your resources, complete the following steps:

  1. On the AWS CloudFormation console, select Stacks in the navigation pane.
  2. Select the LexTalkdesk stack (or the stack name you provided), and select Delete.
  3. Delete the stack resources by selecting Delete stack.

Conclusion

Amazon Lex brings the power of conversational self-service to your customer preferred channels, such as phone, web chat, and messaging applications. In this post, we demonstrated a solution that provides live agent assistance on your website with Amazon Lex, Amazon Lex Web UI, and Talkdesk cloud contact center. We provided a CloudFormation stack that includes DynamoDB and Lambda resources, and a Rest API and WebSocket API in API Gateway to maintain a communication session with agents in the Talkdesk contact center.

This solution is meant to be a reference architecture or a quick implementation guide that can be tailored to suit your organization’s requirements. If you need help setting up this solution, AWS Professional Services and Talkdesk are available to help you and your team through the process of selecting the right technologies for your cloud contact center.


About the authors

Grazia Russo Lassner is a Senior Consultant with the AWS Professional Services Natural Language AI team. She specialises in designing and developing conversational AI solutions using AWS technologies for customers in various industries. Outside of work, she enjoys beach weekends, reading the latest fiction books, and family time.

Austin Johnson is a Solutions Architect, helping to maintain the Lex Web UI open source library.

Chris Brown is a Principal Natural Language AI consultant at AWS focused on digital customer experiences – including mobile apps, websites, marketing campaigns, and most recently conversational AI applications. Chris is an award-winning strategist and product manager – working with the Fortune 100 to deliver the best experiences for their customers. In his free time, Chris enjoys traveling, music, art, and experiencing new cultures.

Bruno Mateus is a Principal Engineer at Talkdesk. With over 20 years of experience in the software industry, he specialises in large-scale distributed systems. When not working, he enjoys spending time outside with his family, trekking, mountain bike riding, and motorcycle riding.

Jonathan Diedrich is a Principal Solutions Consultant at Talkdesk. He works on enterprise and strategic projects to ensure technical execution and adoption. Outside of work, he enjoys ice hockey and games with the family.

Crispim Tribuna is a Senior Software Engineer at Talkdesk currently focusing on the AI-based virtual agent project. He has over 17 years of experience in computer science, with a focus on telecommunications, IPTV, and fraud prevention. In his free time, he enjoys spending time with his family, running (he has completed three marathons), and riding motorcycles.

Read More

Advanced RAG patterns on Amazon SageMaker

Advanced RAG patterns on Amazon SageMaker

Today, customers of all industries—whether it’s financial services, healthcare and life sciences, travel and hospitality, media and entertainment, telecommunications, software as a service (SaaS), and even proprietary model providers—are using large language models (LLMs) to build applications like question and answering (QnA) chatbots, search engines, and knowledge bases. These generative AI applications are not only used to automate existing business processes, but also have the ability to transform the experience for customers using these applications. With the advancements being made with LLMs like the Mixtral-8x7B Instruct, derivative of architectures such as the mixture of experts (MoE), customers are continuously looking for ways to improve the performance and accuracy of generative AI applications while allowing them to effectively use a wider range of closed and open source models.

A number of techniques are typically used to improve the accuracy and performance of an LLM’s output, such as fine-tuning with parameter efficient fine-tuning (PEFT), reinforcement learning from human feedback (RLHF), and performing knowledge distillation. However, when building generative AI applications, you can use an alternative solution that allows for the dynamic incorporation of external knowledge and allows you to control the information used for generation without the need to fine-tune your existing foundational model. This is where Retrieval Augmented Generation (RAG) comes in, specifically for generative AI applications as opposed to the more expensive and robust fine-tuning alternatives we’ve discussed. If you’re implementing complex RAG applications into your daily tasks, you may encounter common challenges with your RAG systems such as inaccurate retrieval, increasing size and complexity of documents, and overflow of context, which can significantly impact the quality and reliability of generated answers.

This post discusses RAG patterns to improve response accuracy using LangChain and tools such as the parent document retriever in addition to techniques like contextual compression in order to enable developers to improve existing generative AI applications.

Solution overview

In this post, we demonstrate the use of Mixtral-8x7B Instruct text generation combined with the BGE Large En embedding model to efficiently construct a RAG QnA system on an Amazon SageMaker notebook using the parent document retriever tool and contextual compression technique. The following diagram illustrates the architecture of this solution.

You can deploy this solution with just a few clicks using Amazon SageMaker JumpStart, a fully managed platform that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly and with ease, accelerating the development and deployment of machine learning (ML) applications. One of the key components of SageMaker JumpStart is the Model Hub, which offers a vast catalog of pre-trained models, such as the Mixtral-8x7B, for a variety of tasks.

Mixtral-8x7B uses an MoE architecture. This architecture allows different parts of a neural network to specialize in different tasks, effectively dividing the workload among multiple experts. This approach enables the efficient training and deployment of larger models compared to traditional architectures.

One of the main advantages of the MoE architecture is its scalability. By distributing the workload across multiple experts, MoE models can be trained on larger datasets and achieve better performance than traditional models of the same size. Additionally, MoE models can be more efficient during inference because only a subset of experts needs to be activated for a given input.

For more information on Mixtral-8x7B Instruct on AWS, refer to Mixtral-8x7B is now available in Amazon SageMaker JumpStart. The Mixtral-8x7B model is made available under the permissive Apache 2.0 license, for use without restrictions.

In this post, we discuss how you can use LangChain to create effective and more efficient RAG applications. LangChain is an open source Python library designed to build applications with LLMs. It provides a modular and flexible framework for combining LLMs with other components, such as knowledge bases, retrieval systems, and other AI tools, to create powerful and customizable applications.

We walk through constructing a RAG pipeline on SageMaker with Mixtral-8x7B. We use the Mixtral-8x7B Instruct text generation model with the BGE Large En embedding model to create an efficient QnA system using RAG on a SageMaker notebook. We use an ml.t3.medium instance to demonstrate deploying LLMs via SageMaker JumpStart, which can be accessed through a SageMaker-generated API endpoint. This setup allows for the exploration, experimentation, and optimization of advanced RAG techniques with LangChain. We also illustrate the integration of the FAISS Embedding store into the RAG workflow, highlighting its role in storing and retrieving embeddings to enhance the system’s performance.

We perform a brief walkthrough of the SageMaker notebook. For more detailed and step-by-step instructions, refer to the Advanced RAG Patterns with Mixtral on SageMaker Jumpstart GitHub repo.

The need for advanced RAG patterns

Advanced RAG patterns are essential to improve upon the current capabilities of LLMs in processing, understanding, and generating human-like text. As the size and complexity of documents increase, representing multiple facets of the document in a single embedding can lead to a loss of specificity. Although it’s essential to capture the general essence of a document, it’s equally crucial to recognize and represent the varied sub-contexts within. This is a challenge you are often faced with when working with larger documents. Another challenge with RAG is that with retrieval, you aren’t aware of the specific queries that your document storage system will deal with upon ingestion. This could lead to information most relevant to a query being buried under text (context overflow). To mitigate failure and improve upon the existing RAG architecture, you can use advanced RAG patterns (parent document retriever and contextual compression) to reduce retrieval errors, enhance answer quality, and enable complex question handling.

With the techniques discussed in this post, you can address key challenges associated with external knowledge retrieval and integration, enabling your application to deliver more precise and contextually aware responses.

In the following sections, we explore how parent document retrievers and contextual compression can help you deal with some of the problems we’ve discussed.

Parent document retriever

In the previous section, we highlighted challenges that RAG applications encounter when dealing with extensive documents. To address these challenges, parent document retrievers categorize and designate incoming documents as parent documents. These documents are recognized for their comprehensive nature but aren’t directly utilized in their original form for embeddings. Rather than compressing an entire document into a single embedding, parent document retrievers dissect these parent documents into child documents. Each child document captures distinct aspects or topics from the broader parent document. Following the identification of these child segments, individual embeddings are assigned to each, capturing their specific thematic essence (see the following diagram). During retrieval, the parent document is invoked. This technique provides targeted yet broad-ranging search capabilities, furnishing the LLM with a wider perspective. Parent document retrievers provide LLMs with a twofold advantage: the specificity of child document embeddings for precise and relevant information retrieval, coupled with the invocation of parent documents for response generation, which enriches the LLM’s outputs with a layered and thorough context.

Contextual compression

To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction.

Prerequisites

If you’re new to SageMaker, refer to the Amazon SageMaker Development Guide.

Before you get started with the solution, create an AWS account. When you create an AWS account, you get a single sign-on (SSO) identity that has complete access to all the AWS services and resources in the account. This identity is called the AWS account root user.

Signing in to the AWS Management Console using the email address and password that you used to create the account gives you complete access to all the AWS resources in your account. We strongly recommend that you do not use the root user for everyday tasks, even the administrative ones.

Instead, adhere to the security best practices in AWS Identity and Access Management (IAM), and create an administrative user and group. Then securely lock away the root user credentials and use them to perform only a few account and service management tasks.

The Mixtral-8x7b model requires an ml.g5.48xlarge instance. SageMaker JumpStart provides a simplified way to access and deploy over 100 different open source and third-party foundation models. In order to launch an endpoint to host Mixtral-8x7B from SageMaker JumpStart, you may need to request a service quota increase to access an ml.g5.48xlarge instance for endpoint usage. You can request service quota increases through the console, AWS Command Line Interface (AWS CLI), or API to allow access to those additional resources.

Set up a SageMaker notebook instance and install dependencies

To get started, create a SageMaker notebook instance and install the required dependencies. Refer to the GitHub repo to ensure a successful setup. After you set up the notebook instance, you can deploy the model.

You can also run the notebook locally on your preferred integrated development environment (IDE). Make sure that you have the Jupyter notebook lab installed.

Deploy the model

Deploy the Mixtral-8X7B Instruct LLM model on SageMaker JumpStart:

# Import the JumpStartModel class from the SageMaker JumpStart library
from sagemaker.jumpstart.model import JumpStartModel

# Specify the model ID for the HuggingFace Mixtral 8x7b Instruct LLM model
model_id = "huggingface-llm-mixtral-8x7b-instruct"
model = JumpStartModel(model_id=model_id)
llm_predictor = model.deploy()

Deploy the BGE Large En embedding model on SageMaker JumpStart:

# Specify the model ID for the HuggingFace BGE Large EN Embedding model
model_id = "huggingface-sentencesimilarity-bge-large-en"
text_embedding_model = JumpStartModel(model_id=model_id)
embedding_predictor = text_embedding_model.deploy()

Set up LangChain

After importing all the necessary libraries and deploying the Mixtral-8x7B model and BGE Large En embeddings model, you can now set up LangChain. For step-by-step instructions, refer to the GitHub repo.

Data preparation

In this post, we use several years of Amazon’s Letters to Shareholders as a text corpus to perform QnA on. For more detailed steps to prepare the data, refer to the GitHub repo.

Question answering

Once the data is prepared, you can use the wrapper provided by LangChain, which wraps around the vector store and takes input for the LLM. This wrapper performs the following steps:

  1. Take the input question.
  2. Create a question embedding.
  3. Fetch relevant documents.
  4. Incorporate the documents and the question into a prompt.
  5. Invoke the model with the prompt and generate the answer in a readable manner.

Now that the vector store is in place, you can start asking questions:

prompt_template = """<s>[INST]
{query}
[INST]"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["query"]
)
query = "How has AWS evolved?"
answer = wrapper_store_faiss.query(question=PROMPT.format(query=query), llm=llm)
print(answer)
AWS, or Amazon Web Services, has evolved significantly since its initial launch in 2006. It started as a feature-poor service, offering only one instance size, in one data center, in one region of the world, with Linux operating system instances only. There was no monitoring, load balancing, auto-scaling, or persistent storage at the time. However, AWS had a successful launch and has since grown into a multi-billion-dollar service.

Over the years, AWS has added numerous features and services, with over 3,300 new ones launched in 2022 alone. They have expanded their offerings to include Windows, monitoring, load balancing, auto-scaling, and persistent storage. AWS has also made significant investments in long-term inventions that have changed what's possible in technology infrastructure.

One example of this is their investment in chip development. AWS has also seen a robust new customer pipeline and active migrations, with many companies opting to move to AWS for the agility, innovation, cost-efficiency, and security benefits it offers. AWS has transformed how customers, from start-ups to multinational companies to public sector organizations, manage their technology infrastructure.

Regular retriever chain

In the preceding scenario, we explored the quick and straightforward way to get a context-aware answer to your question. Now let’s look at a more customizable option with the help of RetrievalQA, where you can customize how the documents fetched should be added to the prompt using the chain_type parameter. Also, in order to control how many relevant documents should be retrieved, you can change the k parameter in the following code to see different outputs. In many scenarios, you might want to know which source documents the LLM used to generate the answer. You can get those documents in the output using return_source_documents, which returns the documents that are added to the context of the LLM prompt. RetrievalQA also allows you to provide a custom prompt template that can be specific to the model.

from langchain.chains import RetrievalQA

prompt_template = """<s>[INST]
Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}

[INST]"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let’s ask a question:

query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])
AWS (Amazon Web Services) evolved from an initially unprofitable investment to an $85B annual revenue run rate business with strong profitability, offering a wide range of services and features, and becoming a significant part of Amazon's portfolio. Despite facing skepticism and short-term headwinds, AWS continued to innovate, attract new customers, and migrate active customers, offering benefits such as agility, innovation, cost-efficiency, and security. AWS also expanded its long-term investments, including chip development, to provide new capabilities and change what's possible for its customers.

Parent document retriever chain

Let’s look at a more advanced RAG option with the help of ParentDocumentRetriever. When working with document retrieval, you may encounter a trade-off between storing small chunks of a document for accurate embeddings and larger documents to preserve more context. The parent document retriever strikes that balance by splitting and storing small chunks of data.

We use a parent_splitter to divide the original documents into larger chunks called parent documents and a child_splitter to create smaller child documents from the original documents:

# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
vectorstore_faiss = FAISS.from_documents(
    child_splitter.split_documents(documents),
    sagemaker_embeddings,
)

The child documents are then indexed in a vector store using embeddings. This enables efficient retrieval of relevant child documents based on similarity. To retrieve relevant information, the parent document retriever first fetches the child documents from the vector store. It then looks up the parent IDs for those child documents and returns the corresponding larger parent documents.

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let’s ask a question:

query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])
AWS (Amazon Web Services) started with a feature-poor initial launch of the Elastic Compute Cloud (EC2) service in 2006, providing only one instance size, in one data center, in one region of the world, with Linux operating system instances only, and without many key features like monitoring, load balancing, auto-scaling, or persistent storage. However, AWS's success allowed them to quickly iterate and add the missing capabilities, eventually expanding to offer various flavors, sizes, and optimizations of compute, storage, and networking, as well as developing their own chips (Graviton) to push price and performance further. AWS's iterative innovation process required significant investments in financial and people resources over 20 years, often well in advance of when it would pay out, to meet customer needs and improve long-term customer experiences, loyalty, and returns for shareholders.

Contextual compression chain

Let’s look at another advanced RAG option called contextual compression. One challenge with retrieval is that usually we don’t know the specific queries your document storage system will face when you ingest data into the system. This means that the information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

The contextual compression retriever addresses the challenge of retrieving relevant information from a document storage system, where the pertinent data may be buried within documents containing a lot of  text. By compressing and filtering the retrieved documents based on the given query context, only the most relevant information is returned.

To use the contextual compression retriever, you’ll need:

  • A base retriever – This is the initial retriever that fetches documents from the storage system based on the query
  • A document compressor – This component takes the initially retrieved documents and shortens them by reducing the contents of individual documents or dropping irrelevant documents altogether, using the query context to determine relevance

Adding contextual compression with an LLM chain extractor

First, wrap your base retriever with a ContextualCompressionRetriever. You’ll add an LLMChainExtractor, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.

from langchain.retrievers import ContextualCompressionRetrieverfrom langchain.retrievers.document_compressors import LLMChainExtractor

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)

docs = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(
    docs,
    sagemaker_embeddings,
).as_retriever()

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(
    "How was Amazon impacted by COVID-19?"
)

Initialize the chain using the ContextualCompressionRetriever with an LLMChainExtractor and pass the prompt in via the chain_type_kwargs argument.

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let’s ask a question:

query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])
AWS evolved by starting as a small project inside Amazon, requiring significant capital investment and facing skepticism from both inside and outside the company. However, AWS had a head start on potential competitors and believed in the value it could bring to customers and Amazon. AWS made a long-term commitment to continue investing, resulting in over 3,300 new features and services launched in 2022. AWS has transformed how customers manage their technology infrastructure and has become an $85B annual revenue run rate business with strong profitability. AWS has also continuously improved its offerings, such as enhancing EC2 with additional features and services after its initial launch.

Filter documents with an LLM chain filter

The LLMChainFilter is a slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents:

from langchain.retrievers.document_compressors import LLMChainFilter

_filter = LLMChainFilter.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(
    "How was Amazon impacted by COVID-19?"
)
print(compressed_docs)

Initialize the chain using the ContextualCompressionRetriever with an LLMChainFilter and pass the prompt in via the chain_type_kwargs argument.

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let’s ask a question:

query = "How did AWS evolve?"
result = qa({"query": query})
print(result['result'])
AWS (Amazon Web Services) evolved by initially launching feature-poor but iterating quickly based on customer feedback to add necessary capabilities. This approach allowed AWS to launch EC2 in 2006 with limited features and then continuously add new functionalities, such as additional instance sizes, data centers, regions, operating system options, monitoring tools, load balancing, auto-scaling, and persistent storage. Over time, AWS transformed from a feature-poor service to a multi-billion-dollar business by focusing on customer needs, agility, innovation, cost-efficiency, and security. AWS now has an $85B annual revenue run rate and offers over 3,300 new features and services each year, catering to a wide range of customers from start-ups to multinational companies and public sector organizations.

Compare results

The following table compares results from different queries based on technique.

Technique Query 1 Query 2 Comparison
How did AWS evolve? Why is Amazon successful?
Regular Retriever Chain Output AWS (Amazon Web Services) evolved from an initially unprofitable investment to an $85B annual revenue run rate business with strong profitability, offering a wide range of services and features, and becoming a significant part of Amazon’s portfolio. Despite facing skepticism and short-term headwinds, AWS continued to innovate, attract new customers, and migrate active customers, offering benefits such as agility, innovation, cost-efficiency, and security. AWS also expanded its long-term investments, including chip development, to provide new capabilities and change what’s possible for its customers. Amazon is successful due to its continuous innovation and expansion into new areas such as technology infrastructure services, digital reading devices, voice-driven personal assistants, and new business models like the third-party marketplace. Its ability to scale operations quickly, as seen in the rapid expansion of its fulfillment and transportation networks, also contributes to its success. Additionally, Amazon’s focus on optimization and efficiency gains in its processes has resulted in productivity improvements and cost reductions. The example of Amazon Business highlights the company’s capability to leverage its e-commerce and logistics strengths in different sectors. Based on the responses from the regular retriever chain, we notice that although it provides long answers, it suffers from context overflow and fails to mention any significant details from the corpus in regards to responding to the query provided. The regular retrieval chain is not able to capture the nuances with depth or contextual insight, potentially missing critical aspects of the document.
Parent Document Retriever Output AWS (Amazon Web Services) started with a feature-poor initial launch of the Elastic Compute Cloud (EC2) service in 2006, providing only one instance size, in one data center, in one region of the world, with Linux operating system instances only, and without many key features like monitoring, load balancing, auto-scaling, or persistent storage. However, AWS’s success allowed them to quickly iterate and add the missing capabilities, eventually expanding to offer various flavors, sizes, and optimizations of compute, storage, and networking, as well as developing their own chips (Graviton) to push price and performance further. AWS’s iterative innovation process required significant investments in financial and people resources over 20 years, often well in advance of when it would pay out, to meet customer needs and improve long-term customer experiences, loyalty, and returns for shareholders. Amazon is successful due to its ability to constantly innovate, adapt to changing market conditions, and meet customer needs in various market segments. This is evident in the success of Amazon Business, which has grown to drive roughly $35B in annualized gross sales by delivering selection, value, and convenience to business customers. Amazon’s investments in ecommerce and logistics capabilities have also enabled the creation of services like Buy with Prime, which helps merchants with direct-to-consumer websites drive conversion from views to purchases. The parent document retriever delves deeper into the specifics of AWS’s growth strategy, including the iterative process of adding new features based on customer feedback and the detailed journey from a feature-poor initial launch to a dominant market position, while providing a context-rich response. Responses cover a wide range of aspects, from technical innovations and market strategy to organizational efficiency and customer focus, providing a holistic view of the factors contributing to success along with examples. This can be attributed to the parent document retriever’s targeted yet broad-ranging search capabilities.
LLM Chain Extractor: Contextual Compression Output AWS evolved by starting as a small project inside Amazon, requiring significant capital investment and facing skepticism from both inside and outside the company. However, AWS had a head start on potential competitors and believed in the value it could bring to customers and Amazon. AWS made a long-term commitment to continue investing, resulting in over 3,300 new features and services launched in 2022. AWS has transformed how customers manage their technology infrastructure and has become an $85B annual revenue run rate business with strong profitability. AWS has also continuously improved its offerings, such as enhancing EC2 with additional features and services after its initial launch. Based on the provided context, Amazon’s success can be attributed to its strategic expansion from a book-selling platform to a global marketplace with a vibrant third-party seller ecosystem, early investment in AWS, innovation in introducing the Kindle and Alexa, and substantial growth in annual revenue from 2019 to 2022. This growth led to the expansion of the fulfillment center footprint, creation of a last-mile transportation network, and building a new sortation center network, which were optimized for productivity and cost reductions. The LLM chain extractor maintains a balance between covering key points comprehensively and avoiding unnecessary depth. It dynamically adjusts to the query’s context, so the output is directly relevant and comprehensive.
LLM Chain Filter: Contextual Compression Output AWS (Amazon Web Services) evolved by initially launching feature-poor but iterating quickly based on customer feedback to add necessary capabilities. This approach allowed AWS to launch EC2 in 2006 with limited features and then continuously add new functionalities, such as additional instance sizes, data centers, regions, operating system options, monitoring tools, load balancing, auto-scaling, and persistent storage. Over time, AWS transformed from a feature-poor service to a multi-billion-dollar business by focusing on customer needs, agility, innovation, cost-efficiency, and security. AWS now has an $85B annual revenue run rate and offers over 3,300 new features and services each year, catering to a wide range of customers from start-ups to multinational companies and public sector organizations. Amazon is successful due to its innovative business models, continuous technological advancements, and strategic organizational changes. The company has consistently disrupted traditional industries by introducing new ideas, such as an ecommerce platform for various products and services, a third-party marketplace, cloud infrastructure services (AWS), the Kindle e-reader, and the Alexa voice-driven personal assistant. Additionally, Amazon has made structural changes to improve its efficiency, such as reorganizing its US fulfillment network to decrease costs and delivery times, further contributing to its success. Similar to the LLM chain extractor, the LLM chain filter makes sure that although the key points are covered, the output is efficient for customers looking for concise and contextual answers.

Upon comparing these different techniques, we can see that in contexts like detailing AWS’s transition from a simple service to a complex, multi-billion-dollar entity, or explaining Amazon’s strategic successes, the regular retriever chain lacks the precision the more sophisticated techniques offer, leading to less targeted information. Although very few differences are visible between the advanced techniques discussed, they are by far more informative than regular retriever chains.

For customers in industries such as healthcare, telecommunications, and financial services who are looking to implement RAG in their applications, the limitations of the regular retriever chain in providing precision, avoiding redundancy, and effectively compressing information make it less suited to fulfilling these needs compared to the more advanced parent document retriever and contextual compression techniques. These techniques are able to distill vast amounts of information into the concentrated, impactful insights that you need, while helping improve price-performance.

Clean up

When you’re done running the notebook, delete the resources you created in order to avoid accrual of charges for the resources in use:

# Delete resources
llm_predictor.delete_model()
llm_predictor.delete_endpoint()
embedding_predictor.delete_model()
embedding_predictor.delete_endpoint()

Conclusion

In this post, we presented a solution that allows you to implement the parent document retriever and contextual compression chain techniques to enhance the ability of LLMs to process and generate information. We tested out these advanced RAG techniques with the Mixtral-8x7B Instruct and BGE Large En models available with SageMaker JumpStart. We also explored using persistent storage for embeddings and document chunks and integration with enterprise data stores.

The techniques we performed not only refine the way LLM models access and incorporate external knowledge, but also significantly improve the quality, relevance, and efficiency of their outputs. By combining retrieval from large text corpora with language generation capabilities, these advanced RAG techniques enable LLMs to produce more factual, coherent, and context-appropriate responses, enhancing their performance across various natural language processing tasks.

SageMaker JumpStart is at the center of this solution. With SageMaker JumpStart, you gain access to an extensive assortment of open and closed source models, streamlining the process of getting started with ML and enabling rapid experimentation and deployment. To get started deploying this solution, navigate to the notebook in the GitHub repo.


About the Authors

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the world with his wife.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and Data Analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Marco Punio is a Solutions Architect focused on generative AI strategy, applied AI solutions and conducting research to help customers hyper-scale on AWS. Marco is a digital native cloud advisor with experience in the FinTech, Healthcare & Life Sciences, Software-as-a-service, and most recently, in Telecommunications industries. He is a qualified technologist with a passion for machine learning, artificial intelligence, and mergers & acquisitions. Marco is based in Seattle, WA and enjoys writing, reading, exercising, and building applications in his free time.

AJ Dhimine is a Solutions Architect at AWS. He specializes in generative AI, serverless computing and data analytics. He is an active member/mentor in Machine Learning Technical Field Community and has published several scientific papers on various AI/ML topics. He works with customers, ranging from start-ups to enterprises, to develop AWSome generative AI solutions. He is particularly passionate about leveraging Large Language Models for advanced data analytics and exploring practical applications that address real-world challenges. Outside of work, AJ enjoys traveling, and is currently at 53 countries with a goal of visiting every country in the world.

Read More

Efficient continual pre-training LLMs for financial domains

Efficient continual pre-training LLMs for financial domains

Large language models (LLMs) are generally trained on large publicly available datasets that are domain agnostic. For example, Meta’s Llama models are trained on datasets such as CommonCrawl, C4, Wikipedia, and ArXiv. These datasets encompass a broad range of topics and domains. Although the resulting models yield amazingly good results for general tasks, such as text generation and entity recognition, there is evidence that models trained with domain-specific datasets can further improve LLM performance. For example, the training data used for BloombergGPT is 51% domain-specific documents, including financial news, filings, and other financial materials. The resulting LLM outperforms LLMs trained on non-domain-specific datasets when tested on finance-specific tasks. The authors of BloombergGPT concluded that their model outperforms all other models tested for four of the five financial tasks. The model provided even better performance when tested for Bloomberg’s internal financial tasks by a wide margin—as much as 60 points better (out of 100). Although you can learn more about the comprehensive evaluation results in the paper, the following sample captured from the BloombergGPT paper can give you a glimpse of the benefit of training LLMs using financial domain-specific data. As shown in the example, the BloombergGPT model provided correct answers while other non-domain-specific models struggled:

Input: CEO of Assicurazioni Generali?
BloombergGPT: Philippe Donnet
GPT-NeoX: Antonio De Lorenzo, Simone Gambarini, Enrico Zanetti
FLAN-T5-XXL: John M Forsyth, Christopher K Peters, {empty string}

Input: CEO of Silicon Valley Bank?
BloombergGPT: Greg Becker
GPT-NeoX: John Hodor, Mike Fridgen, Peter R Briger
FLAN-T5-XXL: Christian Klein, John M Forsyth, John M Forsyth

Input: CEO of IHS Markit Ltd?
BloombergGPT: Lance Uggla
GPT-NeoX: John F Lechleiter, David Steward, Niall O Broin
FLAN-T5-XXL: William M Hawkins, John M Forsyth, John M Forsyth
Input: CEO of Walgreens Boots Alliance?

BloombergGPT: Rosalind Brewer
GPT-NeoX: Mark V Dei, Stefano Pessina, Stefano Pessina
FLAN-T5-XXL: Christian Klein, John M Forsyth, John M Forsyth
Input: CEO of Citigroup Inc?

BloombergGPT: Jane Fraser
GPT-NeoX: Michael L Corbat, Michael L Corbat, Michael L Corbat
FLAN-T5-XXL: Christian Sewing, John M Forsyth, John M Forsyth

This post provides a guide to training LLMs specifically for the financial domain. We cover the following key areas:

  • Data collection and preparation – Guidance on sourcing and curating relevant financial data for effective model training
  • Continual pre-training vs. fine-tuning – When to use each technique to optimize your LLM’s performance
  • Efficient continual pre-training – Strategies to streamline the continual pre-training process, saving time and resources

This post brings together the expertise of the applied science research team within Amazon Finance Technology and the AWS Worldwide Specialist team for the Global Financial Industry. Some of the content is based on the paper Efficient Continual Pre-training for Building Domain Specific Large Language Models.

Collecting and preparing finance data

Domain continual pre-training necessities a large-scale, high-quality, domain-specific dataset. The following are the main steps for domain dataset curation:

  • Identify data sources – Potential data sources for domain corpus include open web, Wikipedia, books, social media, and internal documents.
  • Domain data filters – Because the ultimate goal is to curate domain corpus, you might need apply additional steps to filter out samples that irrelevant to the target domain. This reduces useless corpus for continual pre-training and reduces training cost.
  • Preprocessing – You might consider a series of preprocessing steps to improve data quality and training efficiency. For example, certain data sources can contain a fair number of noisy tokens; deduplication is considered a useful step to improve data quality and reduce training cost.

To develop financial LLMs, you can use two important data sources: News CommonCrawl and SEC filings. An SEC filing is a financial statement or other formal document submitted to the US Securities and Exchange Commission (SEC). Publicly listed companies are required to file various documents regularly. This creates a large number of documents over the years. News CommonCrawl is a dataset released by CommonCrawl in 2016. It contains news articles from news sites all over the world.

News CommonCrawl is available on Amazon Simple Storage Service (Amazon S3) in the commoncrawl bucket at crawl-data/CC-NEWS/. You can get the listings of files using the AWS Command Line Interface (AWS CLI) and the following command:

aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/

In Efficient Continual Pre-training for Building Domain Specific Large Language Models, the authors use a URL and keyword-based approach to filter financial news articles from generic news. Specifically, the authors maintain a list of important financial news outlets and a set of keywords related to financial news. We identify an article as financial news if either it comes from financial news outlets or any keywords show up in the URL. This simple yet effective approach enables you to identify financial news from not only financial news outlets but also finance sections of generic news outlets.

SEC filings are available online through the SEC’s EDGAR (Electronic Data Gathering, Analysis, and Retrieval) database, which provides open data access. You can scrape the filings from EDGAR directly, or use APIs in Amazon SageMaker with a few lines of code, for any period of time and for a large number of tickers (i.e., the SEC assigned identifier). To learn more, refer to SEC Filing Retrieval.

The following table summarizes the key details of both data sources.

. News CommonCrawl SEC Filing
Coverage 2016-2022 1993-2022
Size 25.8 billion words 5.1 billion words

The authors go through a few extra preprocessing steps before the data is fed into a training algorithm. First, we observe that SEC filings contain noisy text due to the removal of tables and figures, so the authors remove short sentences that are deemed to be table or figure labels. Secondly, we apply a locality sensitive hashing algorithm to deduplicate the new articles and filings. For SEC filings, we deduplicate at the section level instead of the document level. Lastly, we concatenate documents into a long string, tokenize it, and chunk the tokenization into pieces of max input length supported by the model to be trained. This improves the throughput of continual pre-training and reduces the training cost.

Continual pre-training vs. fine-tuning

Most available LLMs are general purpose and lack domain-specific abilities. Domain LLMs have shown considerable performance in medical, finance, or scientific domains. For an LLM to acquire domain-specific knowledge, there are four methods: training from scratch, continual pre-training, instruction fine-tuning on domain tasks, and Retrieval Augmented Generation (RAG).

In traditional models, fine-tuning is usually used to create task-specific models for a domain. This means maintaining multiple models for multiple tasks like entity extraction, intent classification, sentiment analysis, or question answering. With the advent of LLMs, the need to maintain separate models has become obsolete by using techniques like in-context learning or prompting. This saves the effort required to maintain a stack of models for related but distinct tasks.

Intuitively, you can train LLMs from scratch with domain-specific data. Although most of the work to create domain LLMs has focused on training from scratch, it is prohibitively expensive. For example, the GPT-4 model costs over $100 million to train. These models are trained on a mix of open domain data and domain data. Continual pre-training can help models acquire domain-specific knowledge without incurring the cost of pre-training from scratch because you pre-train an existing open domain LLM on only the domain data.

With instruction fine-tuning on a task, you can’t make the model acquire domain knowledge because the LLM only acquires domain information contained in the instruction fine-tuning dataset. Unless a very large dataset for instruction fine-tuning is used, it is not enough to acquire domain knowledge. Sourcing high-quality instruction datasets is usually challenging and is the reason to use LLMs in first place. Also, instruction fine-tuning on one task can affect performance on other tasks (as seen in this paper). However, instruction fine-tuning is more cost-effective than either of the pre-training alternatives.

The following figure compares traditional task-specific fine-tuning. vs in-context learning paradigm with LLMs.

RAG is the most effective way of guiding an LLM to generate responses grounded in a domain. Although it can guide a model to generate responses by providing facts from the domain as auxiliary information, it doesn’t acquire the domain-specific language because the LLM is still relying on non-domain language style to generate the responses.

Continual pre-training is a middle ground between pre-training and instruction fine-tuning in terms of cost while being a strong alternative to gaining domain-specific knowledge and style. It can provide a general model over which further instruction fine-tuning on limited instruction data can be performed. Continual pre-training can be a cost-effective strategy for specialized domains where set of downstream tasks is large or unknown and labeled instruction tuning data is limited. In other scenarios, instruction fine-tuning or RAG might be more suitable.

To learn more about fine-tuning, RAG, and model training, refer to Fine-tune a foundation model, Retrieval Augmented Generation (RAG), and Train a Model with Amazon SageMaker, respectively. For this post, we focus on efficient continual pre-training.

Methodology of efficient continual pre-training

Continual pre-training consists of the following methodology:

  • Domain-Adaptive Continual Pre-training (DACP) – In the paper Efficient Continual Pre-training for Building Domain Specific Large Language Models, the authors continually pre-train the Pythia language model suite on the financial corpus to adapt it to the finance domain. The objective is to create financial LLMs by feeding data from the whole financial domain into an open-sourced model. Because the training corpus contains all the curated datasets in the domain, the resultant model should acquire finance-specific knowledge, thereby becoming a versatile model for various financial tasks. This results in FinPythia models.
  • Task-Adaptive Continual Pre-training (TACP) – The authors pre-train the models further on labeled and unlabeled task data to tailor them for specific tasks. In certain circumstances, developers may prefer models delivering better performance on a group of in-domain tasks rather than a domain-generic model. TACP is designed as continual pre-training aiming to enhance performance on targeted tasks, without requirements for labeled data. Specifically, the authors continually pre-train the open sourced models on the task tokens (without labels). The primary limitation of TACP lies in constructing task-specific LLMs instead of foundation LLMs, owing to the sole use of unlabeled task data for training. Although DACP uses a much larger corpus, it is prohibitively expensive. To balance these limitations, the authors propose two approaches that aim to build domain-specific foundation LLMs while preserving superior performance on target tasks:
  • Efficient Task-Similar DACP (ETS-DACP) – The authors propose selecting a subset of financial corpus that is highly similar to the task data using embedding similarity. This subset is used for continual pre-training to make it more efficient. Specifically, the authors continually pre-train the open sourced LLM on a small corpus extracted from the financial corpus that is close to the target tasks in distribution. This can help improve task performance because we adopt the model to the distribution of task tokens despite labeled data not being required.
  • Efficient Task-Agnostic DACP (ETA-DACP) – The authors propose using metrics like perplexity and token type entropy that don’t require task data to select samples from financial corpus for efficient continual pre-training. This approach is designed to deal with scenarios where task data is unavailable or more versatile domain models for the broader domain are preferred. The authors adopt two dimensions to select data samples that are important for obtaining domain information from a subset of pre-training domain data: novelty and diversity. Novelty, measured by the perplexity recorded by the target model, refers to the information that was unseen by the LLM before. Data with high novelty indicates novel knowledge for the LLM, and such data is viewed as more difficult to learn. This updates generic LLMs with intensive domain knowledge during continual pre-training. Diversity, on the other hand, captures the diversity of distributions of token types in the domain corpus, which has been documented as a useful feature in the research of curriculum learning on language modeling.

The following figure compares an example of ETS-DACP (left) vs. ETA-DACP (right).

We adopt two sampling schemes to actively select data points from curated financial corpus: hard sampling and soft sampling. The former is done by first ranking the financial corpus by corresponding metrics and then selecting the top-k samples, where k is predetermined according to the training budget. For the latter, the authors assign sampling weights for each data points according the metric values, and then randomly sample k data points to meet the training budget.

Result and analysis

The authors evaluate the resulting financial LLMs on an array of financial tasks to investigate the efficacy of continual pre-training:

  • Financial Phrase Bank – A sentiment classification task on financial news.
  • FiQA SA – An aspect-based sentiment classification task based on financial news and headlines.
  • Headline – A binary classification task on whether a headline on a financial entity contains certain information.
  • NER – A financial named entity extraction task based on credit risk assessment section of SEC reports. Words in this task are annotated with PER, LOC, ORG, and MISC.

Because financial LLMs are instruction fine-tuned, the authors evaluate models in a 5-shot setting for each task for the sake of robustness. On average, the FinPythia 6.9B outperforms Pythia 6.9B by 10% across four tasks, which demonstrates the efficacy of domain-specific continual pre-training. For the 1B model, the improvement is less profound, but performance still improves 2% on average.

The following figure illustrates the performance difference before and after DACP on both models.

The following figure showcases two qualitative examples generated by Pythia 6.9B and FinPythia 6.9B. For two finance-related questions regarding an investor manager and a financial term, Pythia 6.9B doesn’t understand the term or recognize the name, whereas FinPythia 6.9B generates detailed answers correctly. The qualitative examples demonstrate that continual pre-training enables the LLMs to acquire domain knowledge during the process.

The following table compares various efficient continual pre-training approaches. ETA-DACP-ppl is ETA-DACP based on perplexity (novelty), and ETA-DACP-ent is based on entropy (diversity). ETS-DACP-com is similar to DACP with data selection by averaging all three metrics. The following are a few takeaways from the results:

  • Data selection methods are efficient – They surpass standard continual pre-training with just 10% of training data. Efficient continual pre-training including Task-Similar DACP (ETS-DACP), Task-Agnostic DACP based on entropy (ESA-DACP-ent) and Task-Similar DACP based on all three metrics (ETS-DACP-com) outperforms standard DACP on average despite the fact that they are trained on only 10% of financial corpus.
  • Task-aware data selection works the best in line with small language models research – ETS-DACP records the best average performance among all the methods and, based on all three metrics, records the second-best task performance. This suggests that using unlabeled task data is still an effective approach to boost task performance in the case of LLMs.
  • Task-agnostic data selection is close second – ESA-DACP-ent follows the performance of the task-aware data selection approach, implying that we could still boost task performance by actively selecting high-quality samples not tied to specific tasks. This paves the way to build financial LLMs for the whole domain while achieving superior task performance.

One critical question regarding continual pre-training is whether it negatively affects the performance on non-domain tasks. The authors also evaluate the continually pre-trained model on four widely used generic tasks: ARC, MMLU, TruthQA, and HellaSwag, which measure the ability of question answering, reasoning, and completion. The authors find that continual pre-training does not adversely affect non-domain performance. For more details, refer to Efficient Continual Pre-training for Building Domain Specific Large Language Models.

Conclusion

This post offered insights into data collection and continual pre-training strategies for training LLMs for financial domain. You can start training your own LLMs for financial tasks using Amazon SageMaker Training or Amazon Bedrock today.


About the Authors

Yong Xie is an applied scientist in Amazon FinTech. He focuses on developing large language models and Generative AI applications for finance.

Karan Aggarwal is a Senior Applied Scientist with Amazon FinTech with a focus on Generative AI for finance use-cases. Karan has extensive experience in time-series analysis and NLP, with particular interest in learning from limited labeled data

Aitzaz Ahmad is an Applied Science Manager at Amazon where he leads a team of scientists building various applications of Machine Learning and Generative AI in Finance. His research interests are in NLP, Generative AI, and LLM Agents. He received his PhD in Electrical Engineering from Texas A&M University.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in financial service build machine learning solutions on AWS.

Raghvender Arni leads the Customer Acceleration Team (CAT) within AWS Industries. The CAT is a global cross-functional team of customer facing cloud architects, software engineers, data scientists, and AI/ML experts and designers that drives innovation via advanced prototyping, and drives cloud operational excellence via specialized technical expertise.

Read More

Achieve DevOps maturity with BMC AMI zAdviser Enterprise and Amazon Bedrock

Achieve DevOps maturity with BMC AMI zAdviser Enterprise and Amazon Bedrock

In software engineering, there is a direct correlation between team performance and building robust, stable applications. The data community aims to adopt the rigorous engineering principles commonly used in software development into their own practices, which includes systematic approaches to design, development, testing, and maintenance. This requires carefully combining applications and metrics to provide complete awareness, accuracy, and control. It means evaluating all aspects of a team’s performance, with a focus on continuous improvement, and it applies just as much to mainframe as it does to distributed and cloud environments—maybe more.

This is achieved through practices like infrastructure as code (IaC) for deployments, automated testing, application observability, and complete application lifecycle ownership. Through years of research, the DevOps Research and Assessment (DORA) team has identified four key metrics that indicate the performance of a software development team:

  • Deployment frequency – How often an organization successfully releases to production
  • Lead time for changes – The amount of time it takes a commit to get into production
  • Change failure rate – The percentage of deployments causing a failure in production
  • Time to restore service – How long it takes an organization to recover from a failure in production

These metrics provide a quantitative way to measure the effectiveness and efficiency of DevOps practices. Although much of the focus around analysis of DevOps is on distributed and cloud technologies, the mainframe still maintains a unique and powerful position, and it can use the DORA 4 metrics to further its reputation as the engine of commerce.

This blog post discusses how BMC Software added AWS Generative AI capabilities to its product BMC AMI zAdviser Enterprise. The zAdviser uses Amazon Bedrock to provide summarization, analysis, and recommendations for improvement based on the DORA metrics data.

Challenges of tracking DORA 4 metrics

Tracking DORA 4 metrics means putting the numbers together and placing them on a dashboard. However, measuring productivity is essentially measuring the performance of individuals, which can make them feel scrutinized. This situation might necessitate a shift in organizational culture to focus on collective achievements and emphasize that automation tools enhance the developer experience.

It’s also vital to avoid focusing on irrelevant metrics or excessively tracking data. The essence of DORA metrics is to distill information into a core set of key performance indicators (KPIs) for evaluation. Mean time to restore (MTTR) is often the simplest KPI to track—most organizations use tools like BMC Helix ITSM or others that record events and issue tracking.

Capturing lead time for changes and change failure rate can be more challenging, especially on mainframes. Lead time for changes and change failure rate KPIs aggregate data from code commits, log files, and automated test results. Using a Git-based SCM pulls these insight together seamlessly. Mainframe teams using BMC’s Git-based DevOps platform, AMI DevX ,can collect this data as easily as distributed teams can.

Solution overview

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

BMC AMI zAdviser Enterprise provides a wide range of DevOps KPIs to optimize mainframe development and enable teams to proactvely identify and resolve issues. Using machine learning, AMI zAdviser monitors mainframe build, test and deploy functions across DevOps tool chains and then offers AI-led recommendations for continuous improvement. In addition to capturing and reporting on development KPIs, zAdviser captures data on how the BMC DevX products are adopted and used. This includes the number of programs that were debugged, the outcome of testing efforts using the DevX testing tools, and many other data points. These additional data points can provide deeper insight into the development KPIs, including the DORA metrics, and may be used in future generative AI efforts with Amazon Bedrock.

The following architecture diagram shows the final implementation of zAdviser Enterprise utilizing generative AI to provide summarization, analysis, and recommendations for improvement based on the DORA metrics KPI data.

Architecture Diagram

The solution workflow includes the following steps:

  1. Create the aggregation query to retrieve the metrics from Elasticsearch.
  2. Extract the stored mainframe metrics data from zAdviser, which is hosted in Amazon Elastic Compute Cloud (Amazon EC2) and deployed in AWS.
  3. Aggregate the data retrieved from Elasticsearch and form the prompt for the generative AI Amazon Bedrock API call.
  4. Pass the generative AI prompt to Amazon Bedrock (using Anthropic’s Claude2 model on Amazon Bedrock).
  5. Store the response from Amazon Bedrock (an HTML-formatted document) in Amazon Simple Storage Service (Amazon S3).
  6. Trigger the KPI email process via AWS Lambda:
    1. The HTML-formatted email is extracted from Amazon S3 and added to the body of the email.
    2. The PDF for customer KPIs is extracted from zAdviser and attached to the email.
    3. The email is sent to subscribers.

The following screenshot shows the LLM summarization of DORA metrics generated using Amazon Bedrock and sent as an email to the customer, with a PDF attachment that contains the DORA metrics KPI dashboard report by zAdviser.

Result Summarization

Key takeaways

In this solution, you don’t need to worry about your data being exposed on the internet when sent to an AI client. The API call to Amazon Bedrock doesn’t contain any personally identifiable information (PII) or any data that could identify a customer. The only data transmitted consists of numerical values in the form of the DORA metric KPIs and instructions for the generative AI’s operations. Importantly, the generative AI client does not retain, learn from, or cache this data.

The zAdviser engineering team was successful in rapidly implementing this feature within a short time span. The rapid progress was facilitated by zAdviser’s substantial investment in AWS services and, importantly, the ease of using Amazon Bedrock via API calls. This underscores the transformative power of generative AI technology embodied in the Amazon Bedrock API. This API, equipped with the industry-specific knowledge repository zAdviser Enterprise and customized with continuously collected organization-specific DevOps metrics, demonstrates the potential of AI in this field.

Generative AI has the potential to lower the barrier to entry to build AI-driven organizations. Large language models (LLMs) in particular can bring tremendous value to enterprises seeking to explore and use unstructured data. Beyond chatbots, LLMs can be used in a variety of tasks, such as classification, editing, and summarization.

Conclusion

This post discussed the transformational impact of generative AI technology in the form of Amazon Bedrock APIs equipped with the industry-specific knowledge that BMC zAdviser possesses, tailored with organization-specific DevOps metrics collected on an ongoing basis.

Check out the BMC website to learn more and set up a demo.


About the Authors

Sunil BemarkarSunil Bemarkar is a Sr. Partner Solutions Architect at Amazon Web Services. He works with various Independent Software Vendors (ISVs) and Strategic customers across industries to accelerate their digital transformation journey and cloud adoption.

Vij BalakrishnaVij Balakrishna is a Senior Partner Development manager at Amazon Web Services. She helps independent software vendors (ISVs) across industries to accelerate their digital transformation journey.

Spencer Hallman is the Lead Product Manager for the BMC AMI zAdviser Enterprise. Previously, he was the Product Manager for BMC AMI Strobe and BMC AMI Ops Automation for Batch Thruput. Prior to Product Management, Spencer was the Subject Matter Expert for Mainframe Performance. His diverse experience over the years has also included programming on multiple platforms and languages as well as working in the Operations Research field. He has a Master of Business Administration with a concentration in Operations Research from Temple University and a Bachelor of Science in Computer Science from the University of Vermont. He lives in Devon, PA and when he’s not attending virtual meetings, enjoys walking his dogs, riding his bike and spending time with his family.

Read More

Fine-tune your Amazon Titan Image Generator G1 model using Amazon Bedrock model customization

Fine-tune your Amazon Titan Image Generator G1 model using Amazon Bedrock model customization

Amazon Titan lmage Generator G1 is a cutting-edge text-to-image model, available via Amazon Bedrock, that is able to understand prompts describing multiple objects in various contexts and captures these relevant details in the images it generates. It is available in US East (N. Virginia) and US West (Oregon) AWS Regions and can perform advanced image editing tasks such as smart cropping, in-painting, and background changes. However, users would like to adapt the model to unique characteristics in custom datasets that the model is not already trained on. Custom datasets can include highly proprietary data that is consistent with your brand guidelines or specific styles such as a previous campaign. To address these use cases and generate fully personalized images, you can fine-tune Amazon Titan Image Generator with your own data using custom models for Amazon Bedrock.

From generating images to editing them, text-to-image models have broad applications across industries. They can enhance employee creativity and provide the ability to imagine new possibilities simply with textual descriptions. For example, it can aid design and floor planning for architects and allow faster innovation by providing the ability to visualize various designs without the manual process of creating them. Similarly, it can aid in design across various industries such as manufacturing, fashion design in retail, and game design by streamlining the generation of graphics and illustrations. Text-to-image models also enhance your customer experience by allowing for personalized advertising as well as interactive and immersive visual chatbots in media and entertainment use cases.

In this post, we guide you through the process of fine-tuning the Amazon Titan Image Generator model to learn two new categories: Ron the dog and Smila the cat, our favorite pets. We discuss how to prepare your data for the model fine-tuning task and how to create a model customization job in Amazon Bedrock. Finally, we show you how to test and deploy your fine-tuned model with Provisioned Throughput.

Ron the dog Smila the cat

Evaluating model capabilities before fine-tuning a job

Foundation models are trained on large amounts of data, so it is possible that your model will work well enough out of the box. That’s why it’s good practice to check if you actually need to fine-tune your model for your use case or if prompt engineering is sufficient. Let’s try to generate some images of Ron the dog and Smila the cat with the base Amazon Titan Image Generator model, as shown in the following screenshots.

As expected, the out-of-the-box model does not know Ron and Smila yet, and the generated outputs show different dogs and cats. With some prompt engineering, we can provide more details to get closer to the look of our favorite pets.

Although the generated images are more similar to Ron and Smila, we see that the model is not able to reproduce the full likeness of them. Let’s now start a fine-tuning job with the photos from Ron and Smila to get consistent, personalized outputs.

Fine-tuning Amazon Titan Image Generator

Amazon Bedrock provides you with a serverless experience for fine-tuning your Amazon Titan Image Generator model. You only need to prepare your data and select your hyperparameters, and AWS will handle the heavy lifting for you.

When you use the Amazon Titan Image Generator model to fine-tune, a copy of this model is created in the AWS model development account, owned and managed by AWS, and a model customization job is created. This job then accesses the fine-tuning data from a VPC and the amazon Titan model has its weights updated. The new model is then saved to an Amazon Simple Storage Service (Amazon S3) located in the same model development account as the pre-trained model. It can now be used for inference only by your account and is not shared with any other AWS account. When running inference, you access this model via a provisioned capacity compute or directly, using batch inference for Amazon Bedrock. Independently from the inference modality chosen, your data remains in your account and is not copied to any AWS owned account or used to improve the Amazon Titan Image Generator model.

The following diagram illustrates this workflow.

Data privacy and network security

Your data used for fine-tuning including prompts, as well as the custom models, remain private in your AWS account. They are not shared or used for model training or service improvements, and aren’t shared with third-party model providers. All the data used for fine-tuning is encrypted in transit and at rest. The data remains in the same Region where the API call is processed. You can also use AWS PrivateLink to create a private connection between the AWS account where your data resides and the VPC.

Data preparation

Before you can create a model customization job, you need to prepare your training dataset. The format of your training dataset depends on the type of customization job you are creating (fine-tuning or continued pre-training) and the modality of your data (text-to-text, text-to-image, or image-to-embedding). For the Amazon Titan Image Generator model, you need to provide the images that you want to use for the fine-tuning and a caption for each image. Amazon Bedrock expects your images to be stored on Amazon S3 and the pairs of images and captions to be provided in a JSONL format with multiple JSON lines.

Each JSON line is a sample containing an image-ref, the S3 URI for an image, and a caption that includes a textual prompt for the image. Your images must be in JPEG or PNG format. The following code shows an example of the format:

{"image-ref": "s3://bucket/path/to/image001.png", "caption": "<prompt text>"}
{"image-ref": "s3://bucket/path/to/image002.png", "caption": "<prompt text>"}
{"image-ref": "s3://bucket/path/to/image003.png", "caption": "<prompt text>"}

Because “Ron” and “Smila” are names that could also be used in other contexts, such as a person’s name, we add the identifiers “Ron the dog” and “Smila the cat” when creating the prompt to fine-tune our model. Although it’s not a requirement for the fine-tuning workflow, this additional information provides more contextual clarity for the model when it is being customized for the new classes and will avoid the confusion of ‘“Ron the dog” with a person called Ron and “Smila the cat” with the city Smila in Ukraine. Using this logic, the following images show a sample of our training dataset.

Ron the dog laying on a white dog bed Ron the dog sitting on a tile floor Ron the dog laying on a car seat
Smila the cat lying on a couch Smila the cat staring at the camera laying on a couch Smila the cat laying in a pet carrier

When transforming our data to the format expected by the customization job, we get the following sample structure:

{"image-ref": "<S3_BUCKET_URL>/ron_01.jpg", "caption": "Ron the dog laying on a white dog bed"}
{"image-ref": "<S3_BUCKET_URL>/ron_02.jpg", "caption": "Ron the dog sitting on a tile floor"}
{"image-ref": "<S3_BUCKET_URL>/ron_03.jpg", "caption": "Ron the dog laying on a car seat"}
{"image-ref": "<S3_BUCKET_URL>/smila_01.jpg", "caption": "Smila the cat lying on a couch"}
{"image-ref": "<S3_BUCKET_URL>/smila_02.jpg", "caption": "Smila the cat sitting next to the window next to a statue cat"}
{"image-ref": "<S3_BUCKET_URL>/smila_03.jpg", "caption": "Smila the cat lying on a pet carrier"}

After we have created our JSONL file, we need to store it on an S3 bucket to start our customization job. Amazon Titan Image Generator G1 fine-tuning jobs will work with 5–10,000 images. For the example discussed in this post, we use 60 images: 30 of Ron the dog and 30 of Smila the cat. In general, providing more varieties of the style or class you are trying to learn will improve the accuracy of your fine-tuned model. However, the more images you use for fine-tuning, the more time will be required for the fine-tuning job to complete. The number of images used also influence the pricing of your fine-tuned job. Refer to Amazon Bedrock Pricing for more information.

Fine-tuning Amazon Titan Image Generator

Now that we have our training data ready, we can begin a new customization job. This process can be done both via the Amazon Bedrock console or APIs. To use the Amazon Bedrock console, complete the following steps:

  1. On the Amazon Bedrock console, choose Custom models in the navigation pane.
  2. On the Customize model menu, choose Create fine-tuning job.
  3. For Fine-tuned model name, enter a name for your new model.
  4. For Job configuration, enter a name for the training job.
  5. For Input data, enter the S3 path of the input data.
  6. In the Hyperparameters section, provide values for the following:
    1. Number of steps – The number of times the model is exposed to each batch.
    2. Batch size – The number of samples processed before updating the model parameters.
    3. Learning rate – The rate at which the model parameters are updated after each batch. The choice of these parameters depends on a given dataset. As a general guideline, we recommend you start by fixing the batch size to 8, the learning rate to 1e-5, and set the number of steps according to the number of images used, as detailed in the following table.
Number of images provided 8 32 64 1,000 10,000
Number of steps recommended 1,000 4,000 8,000 10,000 12,000

If the results of your fine-tuning job are not satisfactory, consider increasing the number of steps if you don’t observe any signs of the style in generated images, and decreasing the number of steps if you observe the style in the generated images but with artifacts or blurriness. If the fine-tuned model fails to learn the unique style in your dataset even after 40,000 steps, consider increasing the batch size or the learning rate.

  1. In the Output data section, enter the S3 output path where the validation outputs, including the periodically recorded validation loss and accuracy metrics, are stored.
  2. In the Service access section, generate a new AWS Identity and Access Management (IAM) role or choose an existing IAM role with the necessary permissions to access your S3 buckets.

This authorization enables Amazon Bedrock to retrieve input and validation datasets from your designated bucket and store validation outputs seamlessly in your S3 bucket.

  1. Choose Fine-tune model.

With the correct configurations set, Amazon Bedrock will now train your custom model.

Deploy the fine-tuned Amazon Titan Image Generator with Provisioned Throughput

After you create custom model, Provisioned Throughput allows you to allocate a predetermined, fixed rate of processing capacity to the custom model. This allocation provides a consistent level of performance and capacity for handling workloads, which results in better performance in production workloads. The second advantage of Provisioned Throughput is cost control, because standard token-based pricing with on-demand inference mode can be difficult to predict at large scales.

When the fine tuning of your model is complete, this model will appear on the Custom models’ page on the Amazon Bedrock console.

To purchase Provisioned Throughput, select the custom model that you just fine-tuned and choose Purchase Provisioned Throughput.

This prepopulates the selected model for which you want to purchase Provisioned Throughput. For testing your fine-tuned model before deployment, set model units to a value of 1 and set the commitment term to No commitment. This quickly lets you start testing your models with your custom prompts and check if the training is adequate. Moreover, when new fine-tuned models and new versions are available, you can update the Provisioned Throughput as long as you update it with other versions of the same model.

Fine-tuning results

For our task of customizing the model on Ron the dog and Smila the cat, experiments showed that the best hyperparameters were 5,000 steps with a batch size of 8 and a learning rate of 1e-5.

The following are some examples of the images generated by the customized model.

Ron the dog wearing a superhero cape Ron the dog on the moon Ron the dog in a swimming pool with sunglasses
Smila the cat on the snow Smila the cat in black and white staring at the camera Smila the cat wearing a Christmas hat

Conclusion

In this post, we discussed when to use fine-tuning instead of engineering your prompts for better-quality image generation. We showed how to fine-tune the Amazon Titan Image Generator model and deploy the custom model on Amazon Bedrock. We also provided general guidelines on how to prepare your data for fine-tuning and set optimal hyperparameters for more accurate model customization.

As a next step, you can adapt the following example to your use case to generate hyper-personalized images using Amazon Titan Image Generator.


About the Authors

Maira Ladeira Tanke is a Senior Generative AI Data Scientist at AWS. With a background in machine learning, she has over 10 years of experience architecting and building AI applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through generative AI solutions on Amazon Bedrock. In her free time, Maira enjoys traveling, playing with her cat Smila, and spending time with her family someplace warm.

Dani Mitchell is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is focused on computer vision use cases and helping customers across EMEA accelerate their ML journey.

Bharathi Srinivasan is a Data Scientist at AWS Professional Services, where she loves to build cool things on Amazon Bedrock. She is passionate about driving business value from machine learning applications, with a focus on responsible AI. Outside of building new AI experiences for customers, Bharathi loves to write science fiction and challenge herself with endurance sports.

Achin Jain is an Applied Scientist with the Amazon Artificial General Intelligence (AGI) team. He has expertise in text-to-image models and is focused on building the Amazon Titan Image Generator.

Read More