Amazon AWS – Page 79

Evaluate conversational AI agents with Amazon Bedrock

July 25, 2024

by Sharon Li Amazon AWS

As conversational artificial intelligence (AI) agents gain traction across industries, providing reliability and consistency is crucial for delivering seamless and trustworthy user experiences. However, the dynamic and conversational nature of these interactions makes traditional testing and evaluation methods challenging. Conversational AI agents also encompass multiple layers, from Retrieval Augmented Generation (RAG) to function-calling mechanisms that interact with external knowledge sources and tools. Although existing large language model (LLM) benchmarks like MT-bench evaluate model capabilities, they lack the ability to validate the application layers. The following are some common pain points in developing conversational AI agents:

Testing an agent is often tedious and repetitive, requiring a human in the loop to validate the semantics meaning of the responses from the agent, as shown in the following figure.
Setting up proper test cases and automating the evaluation process can be difficult due to the conversational and dynamic nature of agent interactions.
Debugging and tracing how conversational AI agents route to the appropriate action or retrieve the desired results can be complex, especially when integrating with external knowledge sources and tools.

Agent Evaluation, an open source solution using LLMs on Amazon Bedrock, addresses this gap by enabling comprehensive evaluation and validation of conversational AI agents at scale.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Agent Evaluation provides the following:

Built-in support for popular services, including Agents for Amazon Bedrock, Knowledge Bases for Amazon Bedrock, Amazon Q Business, and Amazon SageMaker endpoints
Orchestration of concurrent, multi-turn conversations with your agent while evaluating its responses
Configurable hooks to validate actions triggered by your agent
Integration into continuous integration and delivery (CI/CD) pipelines to automate agent testing
A generated test summary for performance insights including conversation history, test pass rate, and reasoning for pass/fail results
Detailed traces to enable step-by-step debugging of the agent interactions

In this post, we demonstrate how to streamline virtual agent testing at scale using Amazon Bedrock and Agent Evaluation.

Solution overview

To use Agent Evaluation, you need to create a test plan, which consists of three configurable components:

Target – A target represents the agent you want to test
Evaluator – An evaluator represents the workflow and logic to evaluate the target on a test
Test – A test defines the target’s functionality and how you want your end-user to interact with the target, which includes:
- A series of steps representing the interactions between the agent and the end-user
- Your expected results of the conversation

The following figure illustrates how Agent Evaluation works on a high level. The framework implements an LLM agent (evaluator) that will orchestrate conversations with your own agent (target) and evaluate the responses during the conversation.

The following figure illustrates the evaluation workflow. It shows how the evaluator reasons and assesses responses based on the test plan. You can either provide an initial prompt or instruct the evaluator to generate one to initiate the conversation. At each turn, the evaluator engages the target agent and evaluates its response. This process continues until the expected results are observed or the maximum number of conversation turns is reached.

By understanding this workflow logic, you can create a test plan to thoroughly assess your agent’s capabilities.

Use case overview

To illustrate how Agent Evaluation can accelerate the development and deployment of conversational AI agents at scale, let’s explore an example scenario: developing an insurance claim processing agent using Agents for Amazon Bedrock. This insurance claim processing agent is expected to handle various tasks, such as creating new claims, sending reminders for pending documents related to open claims, gathering evidence for claims, and searching for relevant information across existing claims and customer knowledge repositories.

For this use case, the goal is to test the agent’s capability to accurately search and retrieve relevant information from existing claims. You want to make sure the agent provides correct and reliable information about existing claims to end-users. Thoroughly evaluating this functionality is crucial before deployment.

Begin by creating and testing the agent in your development account. During this phase, you interact manually with the conversational AI agent using sample prompts to do the following:

Engage the agent in multi-turn conversations on the Amazon Bedrock console
Validate the responses from the agent
Validate all the actions invoked by the agent
Debug and check traces for any routing failures

With Agent Evaluation, the developer can streamline this process through the following steps:

Configure a test plan:
1. Choose an evaluator from the models provided by Amazon Bedrock.
2. Configure the target, which should be a type that Agent Evaluation supports. For this post, we use an Amazon Bedrock agent.
3. Define the test steps and expected results. In the following example test plan, you have a claim with the ID claim-006 in your test system. You want to confirm that your agent can accurately answer questions about this specific claim.
```
evaluator:
  type: bedrock-claude
  model: claude-haiku
target:
  type: bedrock-agent
  bedrock_agent_alias_id:xxxxxxx
  bedrock_agent_id:xxxxxxx
tests:
  - name: GetOpenClaimsWithDetails
    steps:
      - Ask the agent which claims are open.
      - Ask the agent for details on claim-006.
    expected_results:
      - The agent returns a list of open claims.
      - The agent returns the details on claim-006.
```
Run the test plan from the command line:
```
agenteval run
```
The Agent Evaluation test runner will automatically orchestrate the test based on the test plan, and use the evaluator to determine if the responses from the target match the expected results.
View the result summary.
A result summary will be provided in markdown format. In the following example, the summary indicates that the test failed because the agent was unable to provide accurate information about the existing claim claim-006.

Debug with the trace files of the failed tests.
Agent Evaluation provides detailed trace files for the tests. Each trace file meticulously records every prompt and interaction between the target and the evaluator.For instance, in the _invoke_target step, you can gain valuable insights into the rationale behind the Amazon Bedrock agent’s responses, allowing you to delve deeper into the decision-making process:

{
    "timestamp": "2024-04-17 12:52:11.231906+00:00",
    "step_name": "_invoke_target",
    "preProcessingTrace": {
        "modelInvocationOutput": {
            "parsedResponse": {
                "isValid": false,
                "rationale": "Let's analyze the user's input step-by-step: 1. The user is asking for details on a specific claim with ID 'claim-006'. 2. This does not seem malicious or harmful in any way, so we can rule out Category A. 3. The user is not trying to get information about the agent's functions or manipulate its behavior. So we can rule out Category B. 4. We don't have enough information yet to determine if the agent can actually answer this question. We need to check the conversation history and available functions first. 5. Looking at the conversation history, the agent has access to a function called 'get: :send-reminder: :getAllOpenClaims' that returns a list of open claim IDs. 6. However, 'claim-006' is not in the list of open claim IDs returned. So the agent does not have direct access to details on this specific claim. 7. Therefore, this question falls into Category C - the agent will be unable to answer or assist with this question using only the functions it has access to. Given this analysis, I categorize this input as: <category>C</category>"
            }
        }
    }
}

The trace shows that after reviewing the conversation history, the evaluator concludes, “the agent will be unable to answer or assist with this question using only the functions it has access to.” Consequently, it ends the conversation with the target agent and proceeds to generate the test status.

In the _generate_test_status step, the evaluator generates the test status with reasoning based on the responses from the target.

{ 
    "timestamp": "2024-04-17 12:52:12.976985+00:00", 
    "step_name": "_generate_test_status", 
    "system_prompt": "You are a quality assurance engineer evaluating a conversation between an USER and an AGENT. You will be given an ordered list of steps wrapped in <steps> tags. Each step represents a task that the USER wants to perform when interacting with the AGENT. Your job is analyze the running conversation in <conversation> tags and classify it into the following categories: - A: The USER has attempted all the steps. - B: The USER has not yet attempted all the steps. Please think hard about the response in <thinking> tags before providing only the category letter within <category> tags.", 
    "prompt": "Here are the steps and conversation: <steps> 1. Ask the agent which claims are open. 2. Ask the agent for details on claim-006. <steps> <conversation> USER: Which claims are currently open? AGENT: The open claims are: 2s34w-8x, 5t16u-7v, 3b45c-9d USER: Can you please provide me with the details on claim-006? AGENT: Sorry, I don't have enough information to answer that. </conversation>", 
    "test_status": "B", 
    "reasoning": "The user has attempted the first step of asking which claims are open, and the agent has provided a list of open claims. However, the user has not yet attempted the second step of asking for details on claim-006, as the agent has indicated that they do not have enough information to provide those details." 
}

The test plan defines the expected result as the target agent accurately providing details about the existing claim claim-006. However, after testing, the target agent’s response doesn’t meet the expected result, and the test fails.

After identifying and addressing the issue, you can rerun the test to validate the fix. In this example, it’s evident that the target agent lacks access to the claim claim-006. From there, you can continue investigating and verify if claim-006 exists in your test system.

Integrate Agent Evaluation with CI/CD pipelines

After validating the functionality in the development account, you can commit the code to the repository and initiate the deployment process for the conversational AI agent to the next stage. Seamless integration with CI/CD pipelines is a crucial aspect of Agent Evaluation, enabling comprehensive integration testing to make sure no regressions are introduced during new feature development or updates. This rigorous testing approach is vital for maintaining the reliability and consistency of conversational AI agents as they progress through the software delivery lifecycle.

By incorporating Agent Evaluation into CI/CD workflows, organizations can automate the testing process, making sure every code change or update undergoes thorough evaluation before deployment. This proactive measure minimizes the risk of introducing bugs or inconsistencies that could compromise the conversational AI agent’s performance and the overall user experience.

A standard agent CI/CD pipeline includes the following steps:

The source repository stores the agent configuration, including agent instructions, system prompts, model configuration, and so on. You should always commit your changes to provide quality and reproducibility.
When you commit your changes, a build step is invoked. This is where unit tests should run and validate the changes, including typo and syntax checks.
When the changes are deployed to the staging environment, Agent Evaluation runs with a series of test cases for runtime validation.
The runtime validation on the staging environment can help build confidence to deploy the fully tested agent to production.

The following figure illustrates this pipeline.

In the following sections, we provide step-by-step instructions to set up Agent Evaluation with GitHub Actions.

Prerequisites

Complete the following prerequisite steps:

Follow the GitHub user guide to get started with GitHub.
Follow the GitHub Actions user guide to understand GitHub workflows and Actions.
Follow the insurance claim processing agent using Agents for Amazon Bedrock example to set up an agent.

Set up GitHub Actions

Complete the following steps to deploy the solution:

Write a series of test cases following the agent-evaluation test plan syntax and store test plans in the GitHub repository. For example, a test plan to test an Amazon Bedrock agent target is written as follows, with BEDROCK_AGENT_ALIAS_ID and BEDROCK_AGENT_ID as placeholders:
```
evaluator:
  model: claude-3
target:
  bedrock_agent_alias_id: BEDROCK_AGENT_ALIAS_ID
  bedrock_agent_id: BEDROCK_AGENT_ID
  type: bedrock-agent
tests:
  InsuranceClaimQuestions:
    ...
```
Create an AWS Identity and Access Management (IAM) user with the proper permissions:
1. The principal must have InvokeModel permission to the model specified in the configuration.
2. The principal must have the permissions to call the target agent. Depending on the target type, different permissions are required. Refer to the agent-evaluation target documentation for details.
Store the IAM credentials (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY) in GitHub Actions secrets.

Configure a GitHub workflow as follows:

name: Update Agents for Bedrock

on:
  push:
    branches: [ "main" ]

env:
  AWS_REGION: <Deployed AWS region>                   
  

permissions:
  contents: read

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}

    - name: Install agent-evaluation
      run: |
        pip install agent-evaluation
        agenteval --help

    - name: Test Bedrock Agent
      id: test-bedrock-agent
      env:
        BEDROCK_AGENT_ALIAS_ID: ${{ vars.BEDROCK_AGENT_ALIAS_ID }}
        BEDROCK_AGENT_ID: ${{ vars.BEDROCK_AGENT_ID }}
      run: |
        sed -e "s/BEDROCK_AGENT_ALIAS_ID/$BEDROCK_AGENT_ALIAS_ID/g" -e "s/BEDROCK_AGENT_ID/$BEDROCK_AGENT_ID/g" test_plans/agenteval.yml > agenteval.yml
        agenteval run

    - name: Test Summary
      if: always()
      id: test-summary
      run: |
        cat agenteval_summary.md >> $GITHUB_STEP_SUMMARY

When you push new changes to the repository, it will invoke the GitHub Action, and an example workflow output is displayed, as shown in the following screenshot.

GitHub Action Agent Evaluation test output

A test summary like the following screenshot will be posted to the GitHub workflow page with details on which tests have failed.

GitHub Action Agent Evaluation test summary

The summary also provides the reasons for the test failures.

GitHub Action Agent Evaluation test details

Clean up

Complete the following steps to clean up your resources:

Delete the IAM user you created for the GitHub Action.
Follow the insurance claim processing agent using Agents for Amazon Bedrock example to delete the agent.

Evaluator considerations

By default, evaluators use the InvokeModel API with On-Demand mode, which will incur AWS charges based on input tokens processed and output tokens generated. For the latest pricing details for Amazon Bedrock, refer to Amazon Bedrock pricing.

The cost of running an evaluator for a single test is influenced by the following:

The number and length of the steps
The number and length of expected results
The length of the target agent’s responses

You can view the total number of input tokens processed and output tokens generated by the evaluator using the --verbose flag when you perform a run (agenteval run --verbose).

Conclusion

This post introduced Agent Evaluation, an open source solution that enables developers to seamlessly integrate agent evaluation into their existing CI/CD workflows. By taking advantage of the capabilities of LLMs on Amazon Bedrock, Agent Evaluation enables you to comprehensively evaluate and debug your agents, achieving reliable and consistent performance. With its user-friendly test plan configuration, Agent Evaluation simplifies the process of defining and orchestrating tests, allowing you to focus on refining your agents’ capabilities. The solution’s built-in support for popular services makes it a versatile tool for testing a wide range of conversational AI agents. Moreover, Agent Evaluation’s seamless integration with CI/CD pipelines empowers teams to automate the testing process, making sure every code change or update undergoes rigorous evaluation before deployment. This proactive approach minimizes the risk of introducing bugs or inconsistencies, ultimately enhancing the overall user experience.

The following are some recommendations to consider:

Don’t use the same model to evaluate the results that you use to power the agent. Doing so may introduce biases and lead to inaccurate evaluations.
Block your pipelines on accuracy failures. Implement strict quality gates to help prevent deploying agents that fail to meet the expected accuracy or performance thresholds.
Continuously expand and refine your test plans. As your agents evolve, regularly update your test plans to cover new scenarios and edge cases, and provide comprehensive coverage.
Use Agent Evaluation’s logging and tracing capabilities to gain insights into your agents’ decision-making processes, facilitating debugging and performance optimization.

Agent Evaluation unlocks a new level of confidence in your conversational AI agents’ performance by streamlining your development workflows, accelerating time-to-market, and delivering exceptional user experiences. To further explore the best practices of building and testing conversational AI agent evaluation at scale, get started by trying Agent Evaluation and provide your feedback.

About the Authors

Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.

Bobby Lindsey is a Machine Learning Specialist at Amazon Web Services. He’s been in technology for over a decade, spanning various technologies and multiple roles. He is currently focused on combining his background in software engineering, DevOps, and machine learning to help customers deliver machine learning workflows at scale. In his spare time, he enjoys reading, research, hiking, biking, and trail running.

Tony Chen is a Machine Learning Solutions Architect at Amazon Web Services, helping customers design scalable and robust machine learning capabilities in the cloud. As a former data scientist and data engineer, he leverages his experience to help tackle some of the most challenging problems organizations face with operationalizing machine learning.

Suyin Wang is an AI/ML Specialist Solutions Architect at AWS. She has an interdisciplinary education background in Machine Learning, Financial Information Service and Economics, along with years of experience in building Data Science and Machine Learning applications that solved real-world business problems. She enjoys helping customers identify the right business questions and building the right AI/ML solutions. In her spare time, she loves singing and cooking.

Curt Lockhart is an AI/ML Specialist Solutions Architect at AWS. He comes from a non-traditional background of working in the arts before his move to tech, and enjoys making machine learning approachable for each customer. Based in Seattle, you can find him venturing to local art museums, catching a concert, and wandering throughout the cities and outdoors of the Pacific Northwest.

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

July 25, 2024

by Darren Lin Amazon AWS

Implementing hardware resiliency in your training infrastructure is crucial to mitigating risks and enabling uninterrupted model training. By implementing features such as proactive health monitoring and automated recovery mechanisms, organizations can create a fault-tolerant environment capable of handling hardware failures or other issues without compromising the integrity of the training process.

In the post, we introduce the AWS Neuron node problem detector and recovery DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). This component can quickly detect rare occurrences of issues when Neuron devices fail by tailing monitoring logs. It marks the worker nodes in a defective Neuron device as unhealthy, and promptly replaces them with new worker nodes. By accelerating the speed of issue detection and remediation, it increases the reliability of your ML training and reduces the wasted time and cost due to hardware failure.

This solution is applicable if you’re using managed nodes or self-managed node groups (which use Amazon EC2 Auto Scaling groups) on Amazon EKS. At the time of writing this post, automatic recovery of nodes provisioned by Karpenter is not yet supported.

Solution overview

The solution is based on the node problem detector and recovery DaemonSet, a powerful tool designed to automatically detect and report various node-level problems in a Kubernetes cluster.

The node problem detector component will continuously monitor the kernel message (kmsg) logs on the worker nodes. If it detects error messages specifically related to the Neuron device (which is the Trainium or AWS Inferentia chip), it will change NodeCondition to NeuronHasError on the Kubernetes API server.

The node recovery agent is a separate component that periodically checks the Prometheus metrics exposed by the node problem detector. When it finds a node condition indicating an issue with the Neuron device, it will take automated actions. First, it will mark the affected instance in the relevant Auto Scaling group as unhealthy, which will invoke the Auto Scaling group to stop the instance and launch a replacement. Additionally, the node recovery agent will publish Amazon CloudWatch metrics for users to monitor and alert on these events.

The following diagram illustrates the solution architecture and workflow.

In the following walkthrough, we create an EKS cluster with Trn1 worker nodes, deploy the Neuron plugin for the node problem detector, and inject an error message into the node. We then observe the failing node being stopped and replaced with a new one, and find a metric in CloudWatch indicating the error.

Prerequisites

Before you start, make sure you have installed the following tools on your machine:

The latest version of the AWS Command Line Interface (AWS CLI)
eksctl
kubectl
Terraform
The Session Manager plugin

Deploy the node problem detection and recovery plugin

Complete the following steps to configure the node problem detection and recovery plugin:

Create an EKS cluster using the data on an EKS Terraform module:

git clone https://github.com/awslabs/data-on-eks.git

export TF_VAR_region=us-east-2
export TF_VAR_trn1_32xl_desired_size=4
export TF_VAR_trn1_32xl_min_size=4
cd data-on-eks/ai-ml/trainium-inferentia/ && chmod +x install.sh
./install.sh

aws eks --region us-east-2 describe-cluster --name trainium-inferentia

# Creates k8s config file to authenticate with EKS
aws eks --region us-east-2 update-kubeconfig --name trainium-inferentia

kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-100-64-161-213.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fdde
ip-100-64-227-31.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fdde
ip-100-64-70-179.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fdde

Install the required AWS Identity and Access Management (IAM) role for the service account and the node problem detector plugin.
Create a policy as shown below. Update the Resource key value to match your node group ARN that contains the Trainium and AWS Inferentia nodes, and update the ec2:ResourceTag/aws:autoscaling:groupName key value to match the Auto Scaling group name.

You can get these values from the Amazon EKS console. Choose Clusters in the navigation pane, open the trainium-inferentia cluster, choose Node groups, and locate your node group.

# To create the policy, aws cli can be used as shown below where npd-policy-trimmed.json is the policy json constructed from the template above.

# Create npd-policy-trimmed.json
cat << EOF > npd-policy-trimmed.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "autoscaling:SetInstanceHealth",
                "autoscaling:DescribeAutoScalingInstances"
            ],
            "Effect": "Allow",
            "Resource": <arn of the Auto Scaling group corresponding to the Neuron nodes for the cluster>
        },
        {
            "Action": [
                "ec2:DescribeInstances"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "ForAllValues:StringEquals": {
                    "ec2:ResourceTag/aws:autoscaling:groupName": <name of the Auto Scaling group corresponding to the Neuron nodes for the cluster>
                }
            }
        },
        {
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "cloudwatch:Namespace": "NeuronHealthCheck"
                }
            }
        }
    ]
}
EOF

This component will be installed as a DaemonSet in your EKS cluster.

# To create the policy, aws cli can be used as shown below where npd-policy-trimmed.json is the policy json constructed from the template above.

aws iam create-policy  
--policy-name NeuronProblemDetectorPolicy 
--policy-document file://npd-policy-trimmed.json

# Note the ARN

CLUSTER_NAME=trainium-inferentia # Your EKS Cluster Name 
AWS_REGION=us-east-2
ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
POLICY_ARN=arn:aws:iam::$ACCOUNT_ID:policy/NeuronProblemDetectorPolicy

eksctl create addon --cluster $CLUSTER_NAME --name eks-pod-identity-agent 
  --region $AWS_REGION

eksctl create podidentityassociation 
    --cluster $CLUSTER_NAME 
    --namespace neuron-healthcheck-system 
    --service-account-name node-problem-detector 
    --permission-policy-arns="$POLICY_ARN" 
    --region $AWS_REGION
    
# Install the Neuron NPD and recovery plugin 

kubectl create ns neuron-healthcheck-system
curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery.yml | kubectl apply -f - 
curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-rbac.yml | kubectl apply -f - 
curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-config.yml | kubectl apply -f -

# Expected result (with 4 Neuron nodes in cluster):

kubectl get pod -n neuron-healthcheck-system
NAME READY STATUS RESTARTS AGE
node-problem-detector-49p6w 2/2 Running 0 31s
node-problem-detector-j7wct 2/2 Running 0 31s
node-problem-detector-qr6jm 2/2 Running 0 31s
node-problem-detector-vwq8x 2/2 Running 0 31s

The container images in the Kubernetes manifests are stored in public repository such as registry.k8s.io and public.ecr.aws. For production environments, it’s recommended that customers limit external dependencies that impact these areas and host container images in a private registry and sync from images public repositories. For detailed implementation, please refer to the blog post: Announcing pull through cache for registry.k8s.io in Amazon Elastic Container Registry.

By default, the node problem detector will not take any actions on failed node. If you would like the EC2 instance to be terminated automatically by the agent, update the DaemonSet as follows:

kubectl edit -n neuron-healthcheck-system ds/node-problem-detector

...
   env:
   - name: ENABLE_RECOVERY
     value: "true"

Test the node problem detector and recovery solution

After the plugin is installed, you can see Neuron conditions show up by running kubectl describe node. We simulate a device error by injecting error logs in the instance:

# Verify node conditions on any node. Neuron conditions should show up.

kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep Conditions: -A7

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  NeuronHealth     False   Fri, 29 Mar 2024 15:52:08 +0800   Thu, 28 Mar 2024 13:59:19 +0800   NeuronHasNoError             Neuron has no error
  MemoryPressure   False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:59:08 +0800   KubeletReady                 kubelet is posting ready status
# To get provider id
kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep -i provider | sed -E 's/.*/([^/]+)$/1/'

i-0381404aa69eae3f6

# SSH into to the worker node and simulate the hardware error on the neuron device
aws ssm start-session --target i-0381404aa69eae3f6 --region us-east-2

Starting session with SessionId: lindarr-0069460593240662a

sh-4.2$
sh-4.2$ sudo bash
[root@ip-192-168-93-211 bin]# echo "test NEURON_HW_ERR=DMA_ERROR test" >> /dev/kmsg

Around 2 minutes later, you can see that the error has been identified:

kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep 'Conditions:' -A7

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  NeuronHealth     True    Fri, 29 Mar 2024 17:42:43 +0800   Fri, 29 Mar 2024 17:42:38 +0800   NeuronHasError_DMA_ERROR     test NEURON_HW_ERR=DMA_ERROR test

...

Events:
  Type     Reason                    Age   From            Message
  ----     ------                    ----  ----            -------
  Warning  NeuronHasError_DMA_ERROR  36s   kernel-monitor  Node condition NeuronHealth is now: True, reason: NeuronHasError_DMA_ERROR, message: "test NEURON_HW_ERR=DMA_ERROR test"

Now that the error has been detected by the node problem detector, and the recovery agent has automatically taken the action to set the node as unhealthy, Amazon EKS will cordon the node and evict the pods on the node:

# Verify the Node scheduling is disabled.
kubectl get node 
NAME                                           STATUS                        ROLES    AGE    VERSION
ip-100-64-1-48.us-east-2.compute.internal      Ready                         <none>   156m   v1.29.0-eks-5e0fdde
ip-100-64-103-26.us-east-2.compute.internal    Ready                         <none>   94s    v1.29.0-eks-5e0fdde
ip-100-64-239-245.us-east-2.compute.internal   Ready                         <none>   154m   v1.29.0-eks-5e0fdde
ip-100-64-52-40.us-east-2.compute.internal     Ready                         <none>   156m   v1.29.0-eks-5e0fdde
ip-100-64-58-151.us-east-2.compute.internal    NotReady,SchedulingDisabled   <none>   27h    v1.29.0-eks-5e0fdde

You can open the CloudWatch console and verify the metric for NeuronHealthCheck. You can see the CloudWatch NeuronHasError_DMA_ERROR metric has the value 1.

After replacement, you can see a new worker node has been created:

# The new node with age 28s is the new node

kubectl get node 
NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-65-77.us-east-2.compute.internal    Ready    <none>   28s   v1.29.0-eks-5e0fddev1.28.5-eks-5e0fdde
ip-192-168-81-176.us-east-2.compute.internal   Ready    <none>   9d    v1.29.5-eks-5e0fdde
ip-192-168-91-218.us-east-2.compute.internal   Ready    <none>   9d    v1.29.0-eks-5e0fdde
ip-192-168-94-83.us-east-2.compute.internal    Ready    <none>   9d    v1.29.0-eks-5e0fdde

Let’s look at a real-world scenario, in which you’re running a distributed training job, using an MPI operator as outlined in Llama-2 on Trainium, and there is an irrecoverable Neuron error in one of the nodes. Before the plugin is deployed, the training job will become stuck, resulting in wasted time and computational costs. With the plugin deployed, the node problem detector will proactively remove the problem node from the cluster. In the training scripts, it saves checkpoints periodically so that the training will resume from the previous checkpoint.

The following screenshot shows example logs from a distributed training job.

The training has been started. (You can ignore loss=nan for now; it’s a known issue and will be removed. For immediate use, refer to the reduced_train_loss metric.)

The following screenshot shows the checkpoint created at step 77.

Training stopped after one of the nodes has a problem at step 86. The error was injected manually for testing.

After the faulty node was detected and replaced by the Neuron plugin for node problem and recovery, the training process resumed at step 77, which was the last checkpoint.

Although Auto Scaling groups will stop unhealthy nodes, they may encounter issues preventing the launch of replacement nodes. In such cases, training jobs will stall and require manual intervention. However, the stopped node will not incur further charges on the associated EC2 instance.

If you want to take custom actions in addition to stopping instances, you can create CloudWatch alarms watching the metrics NeuronHasError_DMA_ERROR,NeuronHasError_HANG_ON_COLLECTIVES, NeuronHasError_HBM_UNCORRECTABLE_ERROR, NeuronHasError_SRAM_UNCORRECTABLE_ERROR, and NeuronHasError_NC_UNCORRECTABLE_ERROR, and use a CloudWatch Metrics Insights query like SELECT AVG(NeuronHasError_DMA_ERROR) FROM NeuronHealthCheck to sum up these values to evaluate the alarms. The following screenshots show an example.

Clean up

To clean up all the provisioned resources for this post, run the cleanup script:

# neuron-problem-detector-role-$CLUSTER_NAME
eksctl delete podidentityassociation 
--service-account-name node-problem-detector 
--namespace neuron-healthcheck-system 
--cluster $CLUSTER_NAME 
--region $AWS_REGION

# delete the EKS Cluster
cd data-on-eks/ai-ml/trainium-inferentia
./cleanup.sh

Conclusion

In this post, we showed how the Neuron problem detector and recovery DaemonSet for Amazon EKS works for EC2 instances powered by Trainium and AWS Inferentia. If you’re running Neuron based EC2 instances and using managed nodes or self-managed node groups, you can deploy the detector and recovery DaemonSet in your EKS cluster and benefit from improved reliability and fault tolerance of your machine learning training workloads in the event of node failure.

About the authors

Harish Rao is a senior solutions architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.

Ziwen Ning is a software development engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with badminton, swimming and other various sports, and immersing himself in music.

Geeta Gharpure is a senior software developer on the Annapurna ML engineering team. She is focused on running large scale AI/ML workloads on Kubernetes. She lives in Sunnyvale, CA and enjoys listening to Audible in her free time.

Darren Lin is a Cloud Native Specialist Solutions Architect at AWS who focuses on domains such as Linux, Kubernetes, Container, Observability, and Open Source Technologies. In his spare time, he likes to work out and have fun with his family.

Mistral Large 2 is now available in Amazon Bedrock

July 24, 2024

by Niithiyn Vijeaswaran Amazon AWS

Mistral AI’s Mistral Large 2 (24.07) foundation model (FM) is now generally available in Amazon Bedrock. Mistral Large 2 is the newest version of Mistral Large, and according to Mistral AI offers significant improvements across multilingual capabilities, math, reasoning, coding, and much more.

In this post, we discuss the benefits and capabilities of this new model with some examples.

Overview of Mistral Large 2

Mistral Large 2 is an advanced large language model (LLM) with state-of-the-art reasoning, knowledge, and coding capabilities according to Mistral AI. It is multi-lingual by design, supporting dozens of languages, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, Polish, Arabic, and Hindi. Per Mistral AI, a significant effort was also devoted to enhancing the model’s reasoning capabilities. One of the key focuses during training was to minimize the model’s tendency to hallucinate, or generate plausible-sounding but factually incorrect or irrelevant information. This was achieved by fine-tuning the model to be more cautious and discerning in its responses, making sure it provides reliable and accurate outputs. Additionally, the new Mistral Large 2 is trained to acknowledge when it can’t find solutions or doesn’t have sufficient information to provide a confident answer.

According to Mistral AI, the model is also proficient in coding, trained on over 80 programming languages such as Python, Java, C, C++, JavaScript, Bash, Swift, and Fortran. With its best-in-class agentic capabilities, it can natively call functions and output JSON, enabling seamless interaction with external systems, APIs, and tools. Additionally, Mistral Large 2 (24.07) boasts advanced reasoning and mathematical capabilities, making it a powerful asset for tackling complex logical and computational challenges.

Mistral Large 2 also offers an increased context window of 128,000 tokens. At the time of writing, the model (mistral.mistral-large-2407-v1:0) is available in the us-west-2 AWS Region.

Get started with Mistral Large 2 on Amazon Bedrock

If you’re new to using Mistral AI models, you can request model access on the Amazon Bedrock console. For more details, see Manage access to Amazon Bedrock foundation models.

To test Mistral Large 2 on the Amazon Bedrock console, choose Text or Chat under Playgrounds in the navigation pane. Then choose Select model and choose Mistral as the category and Mistral Large 24.07 as the model.

By choosing View API request, you can also access the model using code examples in the AWS Command Line Interface (AWS CLI) and AWS SDKs. You can use model IDs such as mistral.mistral-large-2407-v1:0, as shown in the following code:

$ aws bedrock-runtime invoke-model  
--model-id mistral.mistral-large-2407-v1:0 
--body "{"prompt":"<s>[INST] this is where you place your input text [/INST]", "max_tokens":200, "temperature":0.5, "top_p":0.9, "top_k":50}"  
--cli-binary-format raw-in-base64-out 
--region us-west-2  
invoke-model-output.txt

In the following sections, we dive into the capabilities of Mistral Large 2.

Increased context window

Mistral Large 2 supports a context window of 128,000 tokens, compared to Mistral Large (24.02), which had a 32,000-token context window. This larger context window is important for developers because it allows the model to process and understand longer pieces of text, such as entire documents or code files, without losing context or coherence. This can be particularly useful for tasks like code generation, documentation analysis, or any application that requires understanding and processing large amounts of text data.

Generating JSON and tool use

Mistral Large 2 now offers a native JSON output mode. This feature allows developers to receive the model’s responses in a structured, easy-to-read format that can be readily integrated into various applications and systems. With JSON being a widely adopted data exchange standard, this capability simplifies the process of working with the model’s outputs, making it more accessible and practical for developers across different domains and use cases. To learn more about how to generate JSON with the Converse API, refer to Generating JSON with the Amazon Bedrock Converse API.

To generate JSON with the Converse API, you need to define a toolSpec. In the following code, we present an example for a travel agent company that will take passenger information and requests and convert them to JSON:

# Define the tool configuration
import json
tool_list = [
    {
        "toolSpec": {
            "name": "travel_agent",
            "description": "Converts trip details as a json structure.",
            "inputSchema": {
                "json": {
                    "type": "object",
                    "properties": {
                        "origin_airport": {
                            "type": "string",
                            "description": "Origin airport (IATA code)"
                        },
                        "destination_airport": {
                            "type": "boolean",
                            "description": "Destination airport (IATA code)"
                        },
                        "departure_date": {
                            "type": "string",
                            "description": "Departure date",
                        }, 
                        "return_date": {
                            "type": "string",
                            "description": "Return date",
                        }
                    },
                    "required": [
                        "origin_airport",
                        "destination_airport",
                        "departure_date",
                        "return_date"
                    ]
                }
            }
        }
    }
]
content = """
I would like to book a flight from New York (JFK) to London (LHR) for a round-trip.
The departure date is June 15, 2023, and the return date is June 25, 2023.

For the flight preferences, I would prefer to fly with Delta or United Airlines. 
My preferred departure time range is between 8 AM and 11 AM, and my preferred arrival time range is between 9 AM and 1 PM (local time in London). 
I am open to flights with one stop, but no more than that.
Please include non-stop flight options if available.
"""

message = {
    "role": "user",
    "content": [
        { "text": f"<content>{content}</content>" },
        { "text": "Please create a well-structured JSON object representing the flight booking request, ensuring proper nesting and organization of the data. Include sample data for better understanding. Create the JSON based on the content within the <content> tags." }
    ],
}
# Bedrock client configuration
response = bedrock_client.converse(
    modelId=model_id,
    messages=[message],
    inferenceConfig={
        "maxTokens": 500,
        "temperature": 0.1
    },
    toolConfig={
        "tools": tool_list
    }
)

response_message = response['output']['message']
response_content_blocks = response_message['content']
content_block = next((block for block in response_content_blocks if 'toolUse' in block), None)
tool_use_block = content_block['toolUse']
tool_result_dict = tool_use_block['input']

print(json.dumps(tool_result_dict, indent=4))

We get the following response:

{
    "origin_airport": "JFK",
    "destination_airport": "LHR",
    "departure_date": "2023-06-15",
    "return_date": "2023-06-25"
}

Mistral Large 2 was able to correctly take our user query and convert the appropriate information to JSON.

Mistral Large 2 also supports the Converse API and tool use. You can use the Amazon Bedrock API to give a model access to tools that can help it generate responses for messages that you send to the model. For example, you might have a chat application that lets users find the most popular song played on a radio station. To answer a request for the most popular song, a model needs a tool that can query and return the song information. The following code shows an example for getting the correct train schedule:

# Define the tool configuration
toolConfig = {
    "tools": [
        {
            "toolSpec": {
                "name": "shinkansen_schedule",
                "description": "Fetches Shinkansen train schedule departure times for a specified station and time.",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "station": {
                                "type": "string",
                                "description": "The station name."
                            },
                            "departure_time": {
                                "type": "string",
                                "description": "The departure time in HH:MM format."
                            }
                        },
                        "required": ["station", "departure_time"]
                    }
                }
            }
        }
    ]
}
# Define shikansen schedule tool
def shinkansen_schedule(station, departure_time):
    schedule = {
        "Tokyo": {"09:00": "Hikari", "12:00": "Nozomi", "15:00": "Kodama"},
        "Osaka": {"10:00": "Nozomi", "13:00": "Hikari", "16:00": "Kodama"}
    }
    return schedule.get(station, {}).get(departure_time, "No train found")
def prompt_mistral(prompt):
    messages = [{"role": "user", "content": [{"text": prompt}]}]
    converse_api_params = {
        "modelId": model_id,
        "messages": messages,
        "toolConfig": toolConfig,  
        "inferenceConfig": {"temperature": 0.0, "maxTokens": 400},
    }

    response = bedrock_client.converse(**converse_api_params)
    
    if response['output']['message']['content'][0].get('toolUse'):
        tool_use = response['output']['message']['content'][0]['toolUse']
        tool_name = tool_use['name']
        tool_inputs = tool_use['input']

        if tool_name == "shinkansen_schedule":
            print("Mistral wants to use the shinkansen_schedule tool")
            station = tool_inputs["station"]
            departure_time = tool_inputs["departure_time"]
            
            try:
                result = shinkansen_schedule(station, departure_time)
                print("Train schedule result:", result)
            except ValueError as e:
                print(f"Error: {str(e)}")

    else:
        print("Mistral responded with:")
        print(response['output']['message']['content'][0]['text'])
prompt_mistral("What train departs Tokyo at 9:00?")

We get the following response:

Mistral wants to use the shinkansen_schedule tool
Train schedule result: Hikari

Mistral Large 2 was able to correctly identify the shinkansen tool and demonstrate its use.

Multilingual support

Mistral Large 2 now supports a large number of character-based languages such as Chinese, Japanese, Korean, Arabic, and Hindi. This expanded language support allows developers to build applications and services that can cater to users from diverse linguistic backgrounds. With multilingual capabilities, developers can create localized UIs, provide language-specific content and resources, and deliver a seamless experience for users regardless of their native language.

In the following example, we translate customer emails generated by the author into different languages such as Hindi and Japanese:

emails= """
"I recently bought your RGB gaming keyboard and absolutely love the customizable lighting features! Can you guide me on how to set up different profiles for each game I play?"
"I'm trying to use the macro keys on the gaming keyboard I just purchased, but they don't seem to be registering my inputs. Could you help me figure out what might be going wrong?"
"I'm considering buying your gaming keyboard and I'm curious about the key switch types. What options are available and what are their main differences?"
"I wanted to report a small issue where my keyboard's space bar is a bit squeaky. However, your quick-start guide was super helpful and I fixed it easily by following the lubrication tips. Just thought you might want to know!"
"My new gaming keyboard stopped working within a week of purchase. None of the keys respond, and the lights don't turn on. I need a solution or a replacement as soon as possible."
"I've noticed that the letters on the keys of my gaming keyboard are starting to fade after several months of use. Is this covered by the warranty?"
"I had an issue where my keyboard settings would reset every time I restarted my PC. I figured out it was due to a software conflict and resolved it by updating the firmware. Just wanted to ask if there are any new updates coming soon?"
"I've been having trouble with the keyboard software not saving my configurations, and it's starting to get frustrating. What can be done to ensure my settings are saved permanently?"
"""
def converse(prompt, inference_config):
    messages = [{"role": "user", "content": [{"text": prompt}]}]
    response = bedrock_client.converse(
        messages=messages,
        modelId=model_id,
        inferenceConfig=inference_config
    )
    generated_text = response['output']['message']['content'][0]['text']
    print(generated_text)
    return generated_text

prompt=f"""

emails={emails}

Translate the following customer emails into these languages:

1. Hindi
2. Japanese

Label each language section accordingly""".format(emails=emails)

inference_config = {"temperature": 0.0, "maxTokens": 4000, "topP": 0.1}

response = converse(prompt, inference_config)

We get the following response:

1. "मैंने हाल ही में आपका RGB गेमिंग कीबोर्ड खरीदा और कस्टमाइजेबल लाइटिंग फीचर्स से बहुत प्रेम करता हूँ! क्या आप मुझे बता सकते हैं कि मैं हर गेम के लिए अलग-अलग प्रोफाइल कैसे सेट कर सकता हूँ?"
2. "मैं अपने नए गेमिंग कीबोर्ड पर मैक्रो कीज का उपयोग करने की कोशिश कर रहा हूँ, लेकिन वे मेरे इनपुट्स को रजिस्टर नहीं कर रहे हैं। क्या आप मुझे बता सकते हैं कि क्या गलत हो सकता है?"
3. "मैं आपका गेमिंग कीबोर्ड खरीदने के बारे में सोच रहा हूँ और मुझे की स्विच टाइप्स के बारे में जानना है। क्या विकल्प उपलब्ध हैं और उनके मुख्य अंतर क्या हैं?"
4. "मैं यह रिपोर्ट करना चाहता था कि मेरे कीबोर्ड का स्पेस बार थोड़ा सा चरमरा रहा है। हालाँकि, आपका क्विक-स्टार्ट गाइड बहुत मददगार था और मैंने लुब्रिकेशन टिप्स का पालन करके इसे आसानी से ठीक कर दिया। बस यह जानना चाहता था कि शायद आपको पता चलना चाहिए!"
5. "मेरा नया गेमिंग कीबोर्ड खरीद के एक सप्ताह के भीतर काम करना बंद हो गया। कोई भी की जवाब नहीं दे रहा है, और लाइट्स भी नहीं चालू हो रहे हैं। मुझे एक समाधान या एक रिप्लेसमेंट जितनी जल्दी हो सके चाहिए।"
6. "मैंने नोट किया है कि मेरे गेमिंग कीबोर्ड के कीज पर अक्षर कुछ महीनों के उपयोग के बाद फेड होने लगे हैं। क्या यह वारंटी के तहत कवर है?"
7. "मेरे कीबोर्ड सेटिंग्स हर बार मेरे पीसी को रीस्टार्ट करने पर रीसेट हो जाती थीं। मैंने पता लगाया कि यह एक सॉफ्टवेयर कॉन्फ्लिक्ट के कारण था और फर्मवेयर अपडेट करके इसे सुलझा दिया। बस पूछना चाहता था कि क्या कोई नए अपडेट आने वाले हैं?"
8. "मेरे कीबोर्ड सॉफ्टवेयर मेरी कॉन्फ़िगरेशन को सेव नहीं कर रहे हैं, और यह अब परेशान करने लगा है। मेरे सेटिंग्स को स्थायी रूप से सेव करने के लिए क्या किया जा सकता है?"

### Japanese

1. "最近、あなたのRGBゲーミングキーボードを購入し、カスタマイズ可能なライティング機能が大好きです！ 各ゲームごとに異なるプロファイルを設定する方法を教えていただけますか？"
2. "新しく購入したゲーミングキーボードのマクロキーを使おうとしていますが、入力が認識されません。何が問題か教えていただけますか？"
3. "あなたのゲーミングキーボードを購入しようと考えていますが、キースイッチの種類について知りたいです。どのようなオプションがあり、その主な違いは何ですか？"
4. "キーボードのスペースバーが少しきしむようになりました。ただし、クイックスタートガイドが非常に役立ち、潤滑のヒントに従って簡単に修理できました。ただ、知っておいてほしいと思いました！"
5. "新しいゲーミングキーボードが購入後1週間で動作しなくなりました。どのキーも反応せず、ライトも点きません。できるだけ早く解決策や交換をお願いします。"
6. "ゲーミングキーボードのキーの文字が数ヶ月の使用後に消え始めました。これは保証に含まれますか？"
7. "キーボードの設定がPCを再起動するたびにリセットされていました。ソフトウェアの競合が原因であることがわかり、ファームウェアを更新して解決しました。新しいアップデートが近々来るのか教えていただけますか？"
8. "キーボードソフトウェアが設定を保存しないため、非常に困っています。設定を永続的に保存するために何ができますか？"

Coding tasks

Mistral Large 2 has been trained on over 80 coding languages, including popular ones like Python, Java, C, C++, JavaScript, and Bash, as well as more specialized languages such as Swift and Fortran. This comprehensive language support empowers developers to tackle a wide range of coding tasks and projects across various domains and platforms. Whether you’re working on web development, mobile applications, scientific computing, or system programming, Mistral Large 2 can assist you with code generation, debugging, refactoring, and other coding-related tasks. For example, the following code requests the model to generate a Python function:

prompt = """
Write a Python function called palindrome_prime_finder that finds all prime numbers within a given range that are also palindromes when written in base 10 (decimal system).
"""

inference_config = {"temperature": 0.0, "maxTokens": 1000, "topP": 0.1}

response = converse(prompt, inference_config)

We get the following response:

Certainly! Below is a Python function called `palindrome_prime_finder` that finds all prime numbers within a given range that are also palindromes when written in base 10.

```python
def is_prime(n):
    """Check if a number is prime."""
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    i = 5
    while i * i <= n:
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True

def is_palindrome(n):
    """Check if a number is a palindrome."""
    return str(n) == str(n)[::-1]

def palindrome_prime_finder(start, end):
    """Find all prime numbers within a given range that are also palindromes."""
    palindrome_primes = []
    for num in range(start, end + 1):
        if is_prime(num) and is_palindrome(num):
            palindrome_primes.append(num)
    return palindrome_primes

# Example usage:
start_range = 1
end_range = 1000
print(palindrome_prime_finder(start_range, end_range))
```

### Explanation:
1. **is_prime(n)**: This function checks if a number `n` is prime. It uses a basic primality test that eliminates even numbers and numbers divisible by 3 early, then checks divisibility up to the square root of `n`.

2. **is_palindrome(n)**: This function checks if a number `n` is a palindrome by converting it to a string and comparing it to its reverse.

3. **palindrome_prime_finder(start, end)**: This function iterates through all numbers in the given range `[start, end]`. For each number, it checks if it is both a prime and a palindrome. If both conditions are met, the number is added to the list `palindrome_primes`.

### Example Usage:
The example usage finds all palindromic prime numbers between 1 and 1000. You can adjust the `start_range` and `end_range` variables to find palindromic primes in any desired range.

Conclusion

Mistral AI’s Mistral Large 2 FM is now available on Amazon Bedrock in the US West (Oregon) Region. To get started with Mistral Large 2 in Amazon Bedrock, visit the Amazon Bedrock console.

Interested in diving deeper? Check out the Mistral-on-AWS repo. For more information about Mistral AI on Amazon Bedrock, refer to Mistral AI models now available on Amazon Bedrock.

About the Authors

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and Data Analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

LLM experimentation at scale using Amazon SageMaker Pipelines and MLflow

July 24, 2024

by Jagdeep Singh Soni Amazon AWS

Large language models (LLMs) have achieved remarkable success in various natural language processing (NLP) tasks, but they may not always generalize well to specific domains or tasks. You may need to customize an LLM to adapt to your unique use case, improving its performance on your specific dataset or task. You can customize the model using prompt engineering, Retrieval Augmented Generation (RAG), or fine-tuning. Evaluation of a customized LLM against the base LLM (or other models) is necessary to make sure the customization process has improved the model’s performance on your specific task or dataset.

In this post, we dive into LLM customization using fine-tuning, exploring the key considerations for successful experimentation and how Amazon SageMaker with MLflow can simplify the process using Amazon SageMaker Pipelines.

LLM selection and fine-tuning journeys

When working with LLMs, customers often have different requirements. Some may be interested in evaluating and selecting the most suitable pre-trained foundation model (FM) for their use case, while others might need to fine-tune an existing model to adapt it to a specific task or domain. Let’s explore two customer journeys:

Selecting and evaluating foundation models – You can evaluate the performance of different pre-trained FMs on relevant datasets and metrics specific to your use case. You can then select the best model based on the evaluation results. You can do this using services such as Amazon SageMaker JumpStart and Amazon SageMaker Clarify. It can also be done at scale, as explained in Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services. The following diagram illustrates an example architecture.

Fine-tuning an LLM for a specific task or domain adaptation – In this user journey, you need to customize an LLM for a specific task or domain data. This requires fine-tuning the model. The fine-tuning process may involve one or more experiment, each requiring multiple iterations with different combinations of datasets, hyperparameters, prompts, and fine-tuning techniques, such as full or Parameter-Efficient Fine-Tuning (PEFT). Each iteration can be considered a run within an experiment.

Fine-tuning an LLM can be a complex workflow for data scientists and machine learning (ML) engineers to operationalize. To simplify this process, you can use Amazon SageMaker with MLflow and SageMaker Pipelines for fine-tuning and evaluation at scale. In this post, we describe the step-by-step solution and provide the source code in the accompanying GitHub repository.

Solution overview

Running hundreds of experiments, comparing the results, and keeping a track of the ML lifecycle can become very complex. This is where MLflow can help streamline the ML lifecycle, from data preparation to model deployment. By integrating MLflow into your LLM workflow, you can efficiently manage experiment tracking, model versioning, and deployment, providing reproducibility. With MLflow, you can track and compare the performance of multiple LLM experiments, identify the best-performing models, and deploy them to production environments with confidence.

You can create workflows with SageMaker Pipelines that enable you to prepare data, fine-tune models, and evaluate model performance with simple Python code for each step.

Now you can use SageMaker managed MLflow to run LLM fine-tuning and evaluation experiments at scale. Specifically:

MLflow can manage tracking of fine-tuning experiments, comparing evaluation results of different runs, model versioning, deployment, and configuration (such as data and hyperparameters)
SageMaker Pipelines can orchestrate multiple experiments based on the experiment configuration

The following figure shows the overview of the solution.

Prerequisites

Before you begin, make sure you have the following prerequisites in place:

Hugging Face login token – You need a Hugging Face login token to access the models and datasets used in this post. For instructions to generate a token, see User access tokens.
SageMaker access with required IAM permissions – You need to have access to SageMaker with the necessary AWS Identity and Access Management (IAM) permissions to create and manage resources. Make sure you have the required permissions to create notebooks, deploy models, and perform other tasks outlined in this post. To get started, see Quick setup to Amazon SageMaker. Please follow this post to make sure you have proper IAM role confugured for MLflow.

Set up an MLflow tracking server

MLflow is directly integrated in Amazon SageMaker Studio. To create an MLflow tracking server to track experiments and runs, complete the following steps:

On the SageMaker Studio console, choose MLflow under Applications in the navigation pane.

For Name, enter an appropriate server name.
For Artifact storage location (S3 URI), enter the location of an Amazon Simple Storage Service (Amazon S3) bucket.
Choose Create.

The tracking server may require up to 20 minutes to initialize and become operational. When it’s running, you can note its ARN to use in the llm_fine_tuning_experiments_mlflow.ipynb notebook. The ARN will have the following format:

arn:aws:sagemaker:<region>:<account_id>:mlflow-tracking-server/<tracking_server_name>

For subsequent steps, you can refer to the detailed description provided in this post, as well as the step-by-step instructions outlined in the llm_fine_tuning_experiments_mlflow.ipynb notebook. You can Launch the notebook in Amazon SageMaker Studio Classic or SageMaker JupyterLab.

Overview of SageMaker Pipelines for experimentation at scale

We use SageMaker Pipelines to orchestrate LLM fine-tuning and evaluation experiments. With SageMaker Pipelines, you can:

Run multiple LLM experiment iterations simultaneously, reducing overall processing time and cost
Effortlessly scale up or down based on changing workload demands
Monitor and visualize the performance of each experiment run with MLflow integration
Invoke downstream workflows for further analysis, deployment, or model selection

MLflow integration with SageMaker Pipelines requires the tracking server ARN. You also need to add the mlflow and sagemaker-mlflow Python packages as dependencies in the pipeline setup. Then you can use MLflow in any pipeline step with the following code snippet:

mlflow_arn="" #get the tracking ARN from step 1
experiment_name="" #experiment name of your choice
mlflow.set_tracking_uri(mlflow_arn)
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name) as run:
        #code for the corresponding step

Log datasets with MLflow

With MLflow, you can log your dataset information alongside other key metrics, such as hyperparameters and model evaluation. This enables tracking and reproducibility of experiments across different runs, allowing for more informed decision-making about which models perform best on specific tasks or domains. By logging your datasets with MLflow, you can store metadata, such as dataset descriptions, version numbers, and data statistics, alongside your MLflow runs.

In the preproccess step, you can log training data and evaluation data. In this example, we download the data from a Hugging Face dataset. We are using HuggingFaceH4/no_robots for fine-tuning and evaluation. First, you need to set the MLflow tracking ARN and experiment name to log data. After you process the data and select the required number of rows, you can log the data using the log_input API of MLflow. See the following code:

mlflow.set_tracking_uri(mlflow_arn)
mlflow.set_experiment(experiment_name)
    
dataset = load_dataset(dataset_name, split="train")
# Data processing implementation

# Data logging with MLflow
df_train = pd.DataFrame(dataset)
training_data = mlflow.data.from_pandas(df_train, source=training_input_path)
mlflow.log_input(training_data, context="training")      
df_evaluate = pd.DataFrame(eval_dataset)
evaluation_data = mlflow.data.from_pandas(df_evaluate, source=eval_input_path)
mlflow.log_input(evaluation_data, context="evaluation")

Fine-tune a Llama model with LoRA and MLflow

To streamline the process of fine-tuning LLM with Low-Rank Adaption (LoRA), you can use MLflow to track hyperparameters and save the resulting model. You can experiment with different LoRA parameters for training and log these parameters along with other key metrics, such as training loss and evaluation metrics. This enables tracking of your fine-tuning process, allowing you to identify the most effective LoRA parameters for a given dataset and task.

For this example, we use the PEFT library from Hugging Face to fine-tune a Llama 3 model. With this library, we can perform LoRA fine-tuning, which offers faster training with reduced memory requirements. It can also work well with less training data.

We use the HuggingFace class from the SageMaker SDK to create a training step in SageMaker Pipelines. The actual implementation of training is defined in llama3_fine_tuning.py. Just like the previous step, we need to set the MLflow tracking URI and use the same run_id:

mlflow.set_tracking_uri(args.mlflow_arn)
mlflow.set_experiment(args.experiment_name)

with mlflow.start_run(run_id=args.run_id) as run:
# implementation

While using the Trainer class from Transformers, you can mention where you want to report the training arguments. In our case, we want to log all the training arguments to MLflow:

trainer = transformers.Trainer(
        model=model,
        train_dataset=lm_train_dataset,
        eval_dataset=lm_test_dataset,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=per_device_eval_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            gradient_checkpointing=gradient_checkpointing,
            logging_steps=2,
            num_train_epochs=num_train_epochs,
            learning_rate=learning_rate,
            bf16=True,
            save_strategy="no",
            output_dir="outputs",
            report_to="mlflow",
            run_name="llama3-peft",
        ),
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

When the training is complete, you can save the full model, so you need to merge the adapter weights to the base model:

model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
save_dir = "/opt/ml/model/"
model.save_pretrained(save_dir, safe_serialization=True, max_shard_size="2GB")
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(save_dir)

The merged model can be logged to MLflow with the model signature, which defines the expected format for model inputs and outputs, including any additional parameters needed for inference:

params = {
        "top_p": 0.9,
        "temperature": 0.9,
        "max_new_tokens": 200,
    }

signature = infer_signature("inputs","generated_text", params=params)

mlflow.transformers.log_model(
    transformers_model={"model": model, "tokenizer": tokenizer},
    signature=signature,
    artifact_path="model", 
    model_config = params
)

Evaluate the model

Model evaluation is the key step to select the most optimal training arguments for fine-tuning the LLM for a given dataset. In this example, we use the built-in evaluation capability of MLflow with the mlflow.evaluate() API. For question answering models, we use the default evaluator logs exact_match, token_count, toxicity, flesch_kincaid_grade_level, and ari_grade_level.

MLflow can load the model that was logged in the fine-tuning step. The base model is downloaded from Hugging Face and adapter weights are downloaded from the logged model. See the following code:

logged_model = f"runs:/{preprocess_step_ret['run_id']}/model"
loaded_model = mlflow.pyfunc.load_model(model_uri=logged_model)
results = mlflow.evaluate(
    model=loaded_model,
    data=df,
    targets="answer",
    model_type="question-answering",
    evaluator_config={"col_mapping": {"inputs": "question"}},
)

These evaluation results are logged in MLflow in the same run that logged the data processing and fine-tuning step.

Create the pipeline

After you have the code ready for all the steps, you can create the pipeline:

from sagemaker import get_execution_role

pipeline = Pipeline(name=pipeline_name, steps=[evaluate_finetuned_llama7b_instruction_mlflow], parameters=[lora_config])

You can run the pipeline using the SageMaker Studio UI or using the following code snippet in the notebook:

execution1 = pipeline.start()

Compare experiment results

After you start the pipeline, you can track the experiment in MLflow. Each run will log details of the preprocessing, fine-tuning, and evaluation steps. The preprocessing step will log training and evaluation data, and the fine-tuning step will log all training arguments and LoRA parameters. You can select these experiments and compare the results to find the optimal training parameters and best fine-tuned model.

You can open the MLflow UI from SageMaker Studio.

Then you can select the experiment to filter out runs for that experiment. You can select multiple runs to make the comparison.

When you compare, you can analyze the evaluation score against the training arguments.

Register the model

After you analyze the evaluation results of different fine-tuned models, you can select the best model and register it in MLflow. This model will be automatically synced with Amazon SageMaker Model Registry.

Deploy the model

You can deploy the model through the SageMaker console or SageMaker SDK. You can pull the model artifact from MLflow and use the ModelBuilder class to deploy the model:

from sagemaker.serve import ModelBuilder
from sagemaker.serve.mode.function_pointers import Mode
from sagemaker.serve import SchemaBuilder

model_builder = ModelBuilder(
    mode=Mode.SAGEMAKER_ENDPOINT,
    role_arn="<role_arn>",
    model_metadata={
        # both model path and tracking server ARN are required if you use an mlflow run ID or mlflow model registry path as input
        "MLFLOW_MODEL_PATH": "runs:/<run_id>/model",
        "MLFLOW_TRACKING_ARN": "<MLFLOW_TRACKING_ARN>",
    },
    instance_type="ml.g5.12xlarge"
)
model = model_builder.build()
predictor = model.deploy( initial_instance_count=1, instance_type="ml.g5.12xlarge" )

Clean up

In order to not incur ongoing costs, delete the resources you created as part of this post:

Delete the MLflow tracking server.
Run the last cell in the notebook to delete the SageMaker pipeline:

sagemaker_client = boto3.client('sagemaker')
response = sagemaker_client.delete_pipeline(
    PipelineName=pipeline_name,
)

Conclusion

In this post, we focused on how to run LLM fine-tuning and evaluation experiments at scale using SageMaker Pipelines and MLflow. You can use managed MLflow from SageMaker to compare training parameters and evaluation results to select the best model and deploy that model in SageMaker. We also provided sample code in a GitHub repository that shows the fine-tuning, evaluation, and deployment workflow for a Llama3 model.

You can start taking advantage of SageMaker with MLflow for traditional MLOps or to run LLM experimentation at scale.

About the Authors

Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in the Netherlands. He uses his passion for Generative AI to help customers and partners build GenAI applications using AWS services. Jagdeep has 15 years of experience in innovation, experience engineering, digital transformation, cloud architecture and ML applications.

Dr. Sokratis Kartakis is a Principal Machine Learning and Operations Specialist Solutions Architect for Amazon Web Services. Sokratis focuses on enabling enterprise customers to industrialize their ML and generative AI solutions by exploiting AWS services and shaping their operating model, such as MLOps/FMOps/LLMOps foundations, and transformation roadmap using best development practices. He has spent over 15 years inventing, designing, leading, and implementing innovative end-to-end production-level ML and AI solutions in the domains of energy, retail, health, finance, motorsports, and more.

Kirit Thadaka is a Senior Product Manager at AWS focused on generative AI experimentation on Amazon SageMaker. Kirit has extensive experience working with customers to build scalable workflows for MLOps to make them more efficient at bringing models to production.

Piyush Kadam is a Senior Product Manager for Amazon SageMaker, a fully managed service for generative AI builders. Piyush has extensive experience delivering products that help startups and enterprise customers harness the power of foundation models.

Discover insights from Amazon S3 with Amazon Q S3 connector

July 24, 2024

by Kruthi Jayasimha Rao Amazon AWS

Amazon Q is a fully managed, generative artificial intelligence (AI) powered assistant that you can configure to answer questions, provide summaries, generate content, gain insights, and complete tasks based on data in your enterprise. The enterprise data required for these generative-AI powered assistants can reside in varied repositories across your organization. One common repository to store data is Amazon Simple Storage Service (Amazon S3), which is an object storage service that stores data as objects within storage buckets. Customers of all sizes and industries can securely index data from a variety of data sources such as document repositories, web sites, content management systems, customer relationship management systems, messaging applications, database, and so on.

To build a generative AI-based conversational application that’s integrated with the data sources that contain the relevant content an enterprise needs to invest time, money, and people, you need to build connectors to the data sources. Next you need to index the data to make it available for a Retrieval Augmented Generation (RAG) approach where relevant passages are delivered with high accuracy to a large language model (LLM). To do this you need to select an index that provides the capabilities to index the content for semantic and vector search, build the infrastructure to retrieve the data, rank the answers, and build a feature rich web application. You also need to hire and staff a large team to build, maintain and manage such a system.

Amazon Q Business is a fully managed generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q business can help you get fast, relevant answers to pressing questions, solve problems, generate content, and take actions using the data and expertise found in your company’s information repositories, code, and enterprise systems such as Atlassian Jira and others. To do this, Amazon Q provides native data source connectors that can index content into a built-in retriever and uses an LLM to provide accurate, well written answers. A data source connector within Amazon Q helps to integrate and synchronize data from multiple repositories into one index.

Amazon Q Business offers multiple prebuilt connectors to a large number of data sources, including Atlassian Jira, Atlassian Confluence, Amazon S3, Microsoft SharePoint, Salesforce, and many more and can help you create your generative AI solution with minimal configuration. For a full list of Amazon Q supported data source connectors, see Amazon Q connectors.

Now you can use the Amazon Q S3 connector to index your data on S3 and build a generative AI assistant that can derive insights from the data stored. Amazon Q generates comprehensive responses to natural language queries from users by analyzing information across content that it has access to. Amazon Q also supports access control for your data so that the right users can access the right content. Its responses to questions are based on the content that your end user has permissions to access.

This post shows how to configure the Amazon Q S3 connector and derive insights by creating a generative-AI powered conversation experience on AWS using Amazon Q while using access control lists (ACLs) to restrict access to documents based on user permissions.

Finding accurate answers from content in S3 using Amazon Q Business

After you integrate Amazon Q Business with Amazon S3, users can ask questions about the content stored in S3. For example, a user might ask about the main points discussed in a blog post on cloud security, the installation steps outlined in a user guide, findings from a case study on hybrid cloud usage, market trends noted in an analyst report, or key takeaways from a whitepaper on data encryption. This integration helps users to quickly find the specific information they need, improving their understanding and ability to make informed business decisions.

Secure querying with ACL crawling and identity crawling

Secure querying is when a user runs a query and is returned answers from documents that the user has access to and not from documents that the user does not have access to. To enable users to do secure querying, Amazon Q Business honors ACLs of the documents. Amazon Q Business does this by first supporting the indexing of ACLs. Indexing documents with ACLs is crucial for maintaining data security, because documents without ACLs are treated as public. Second, at query time the user’s credentials (email address) are passed along with the query so that only answers from documents that are relevant to the query and that the user is authorized to access are displayed.

A document’s ACL, included in the metadata.json or acl.json files alongside the document in the S3 bucket, contains details such as the user’s email address and local groups.

When a user signs in to a web application to conduct a search, their credentials (such as an email address) need to match what’s in the ACL of the document to return results from that document. The web application that the user uses to retrieve answers would be connected to an identity provider (IdP) or the AWS IAM Identity Center. The user’s credentials from the IdP or IAM Identity Center are referred to here as the federated user credentials. The federated user credentials are passed along with the query so that Amazon Q can return the answers from the documents that this user has access to. However, there are occasions when a user’s federated credentials might be absent from the S3 bucket ACLs. In these instances, only the user’s local alias and local groups are specified in the document’s ACL. Therefore, it’s necessary to map these federated user credentials to the corresponding local user alias and local group in the document’s ACL.

Any document or folder without an explicit ACL Deny clause is treated as public.

Solution overview

As an administrator user of Amazon Q, the high-level steps to set up a generative AI chat application are to create an Amazon Q application, connect to different data sources, and finally deploy your web experience. An Amazon Q web experience is the chat interface that you create using your Amazon Q application. Then, your users can chat with your organization’s Amazon Q web experience, and it can be integrated with IAM Identity Center. You can configure and customize your Amazon Q web experience using either the AWS Management Console for Amazon Q or the Amazon Q API.

Amazon Q understands and respects your existing identities, roles, and permissions and uses this information to personalize its interactions. If a user doesn’t have permission to access data without Amazon Q, they can’t access it using Amazon Q either. The following table outlines which documents each user is authorized to access for our use case. The documents being used in this example are a subset of AWS public documents. In this blog post, we will focus on users Arnav (Guest), Mary, and Pat and their assigned groups.

	First name	Last name	Group	Document type authorized for access
1	Arnav	Desai		Blogs
2	Pat	Candella	Customer	Blogs, user guides
3	Jane	Doe	Sales	Blogs, user guides, and case studies
4	John	Stiles	Marketing	Blogs, user guides, case studies, and analyst reports
5	Mary	Major	Solutions architect	Blogs, user guides, case studies, analyst reports, and whitepapers

Architecture diagram

The following diagram illustrates the solution architecture. Amazon S3 is the data source and documents along with the ACL information are passed to Amazon Q from S3. The user submits a query to the Amazon Q application. Amazon Q retrieves the user and group information and provides answers based on the documents that the user has access to.

In the upcoming sections, we will show you how to implement this architecture.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
Amazon S3 and IAM Identity Center permissions.
Privileges to create an Amazon Q application, AWS resources, and AWS Identity and Access Management (IAM) roles and policies.
Basic knowledge of AWS services and working knowledge of S3.
Follow the steps for Setting up for Amazon Q Business if you’re using Amazon Q Business for the first time.

Prepare your S3 bucket as a data source

In the AWS Region list, choose US East (N. Virginia) as the Region. You can choose any Region that Amazon Q is available in but ensure that you remain in the same Region when creating all other resources. To prepare an S3 bucket as a data source, create an S3 bucket. Note the name of the S3 bucket. Replace <REPLACE-WITH-NAME-OF-S3-BUCKET> with the name of the bucket in the commands below. In a terminal with the AWS Command Line Interface (AWS CLI) or AWS CloudShell, run the following commands to upload the documents to the data source bucket:

aws s3 cp s3://aws-ml-blog/artifacts/building-a-secure-search-application-with-access-controls-kendra/docs.zip .

unzip docs.zip

aws s3 cp Data/ s3://<REPLACE-WITH-NAME-OF-S3-BUCKET>/Data/ --recursive

aws s3 cp Meta/ s3://<REPLACE-WITH-NAME-OF-S3-BUCKET>/Meta/ --recursive

The documents being queried are stored in an S3 bucket. Each document type has a separate folder: blogs, case-studies, analyst reports, user guides, and white papers. This folder structure is contained in a folder named Data as shown below:

Each object in S3 is considered a single document. Any <object-name>.metadata.json file and access control list (ACL) file is considered metadata for the object it’s associated with and not treated as a separate document. In this example, metadata files including the ACLs are in a folder named Meta. We use the Amazon Q S3 connector to configure this S3 bucket as the data source. When the data source is synced with the Amazon Q index, it crawls and indexes all documents and collects the ACLs and document attributes from the metadata files. To learn more about ACLs using metadata files, see Amazon S3 document metadata. Here’s the sample metadata JSON file:

{
   "Attributes": {
      "DocumentType": "user-guides"
   },
   "AccessControlList": [
      { "Access": "ALLOW", "Name": "customer", "Type": "GROUP" },
      { "Access": "ALLOW", "Name": "AWS-Sales", "Type": "GROUP" },
      { "Access": "ALLOW", "Name": "AWS-Marketing", "Type": "GROUP" },
      { "Access": "ALLOW", "Name": "AWS-SA", "Type": "GROUP" }
   ]
}

Create users and groups in IAM Identity Center

In this section, you create the following mapping for demonstration:

	User	Group name
1	Arnav
2	Pat	customer
3	Mary	AWS-SA

To create users:

Open the AWS IAM Identity Center
If you haven’t enabled IAM Identity Center, choose Enable. If there’s a pop-up, choose how you want to enable IAM Identity Center. For this example, select Enable only in this AWS account. Choose Continue.
In the IAM Identity Center dashboard, choose Users in the navigation pane.
Choose Add User.
Enter the user details for Mary:
1. Username: mary_major
2. Email address: mary_major@example.com
  Note: Use or create a real email address for each user to use in a later step.
3. First name: Mary
4. Last name: Major
5. Display name: Mary Major
Skip the optional fields and choose Next to create the user.
In the Add user to groups page, choose Next and then choose Add user. Follow the same steps to create users for Pat and Arnav (Guest user).
(You will assign users to groups at a later step.)

To create groups:

Now, you will create two groups: AWS-SA and customer. Choose Groups on the navigation pane and choose Create group.

For the group name, enter AWS-SA, add user Mary to the group,and choose Create group.
Similarly, create a group name customer, add user Pat, and choose Create group.
Now, add multi-factor authentication to the users following the instructions sent to the user email. For more details, see Multi-factor authentication for Identity Center users. When done, you will have the users and groups set up on IAM Identity Center.

Create and configure your Amazon Q application

In this step, you create an Amazon Q application that powers the conversation web experience:

On the AWS Management Console for Amazon Q, in the Region list, choose US East (N. Virginia).
On the Getting started page, select Enable identity-aware sessions. Once enabled, Amazon Q connected to IAM Identity Center should be displayed. Choose Subscribe in Q Business.
On the Amazon Q Business console, choose Get started.
On the Applications page, choose Create application.
On the Create application page, enter Application name and leave everything else with default values.
Choose Create.
On the Select retriever page, for Retrievers, select Use native retriever.
Choose Next. This will take you to the Connect data sources

Configure Amazon S3 as the data source

In this section, you walk through an example of adding an S3 connector. The S3 connector consists of blogs, user guides, case studies, analyst reports, and whitepapers.

To add the S3 connector:

On the Connect data sources page, select Amazon S3 connector.
For Data source name, enter a name for your data source.
In the IAM role section, select Create new service role (Recommended).

In Sync scope section, browse to your S3 bucket containing the data files.
Under Advanced settings, for Metadata files prefix folder location, enter Meta/
Choose Filter patterns. Under Include patterns, enter Data/ as the prefix and choose Add.
For Frequency under Sync run schedule, choose Run on demand.
Leave the rest as default and choose Add data source. Wait until the data source is added.
On the Connect data sources page, choose Next. This will take you to the Add users and groups

Add users and groups in Amazon Q

In this section, you set up users and groups to showcase how access can be managed based on the permissions.

On the Add users and groups page, choose Assign existing users and groups and choose Next.
Enter the users and groups you want to add and choose Assign. You will have to enter the user names and groups in the search box and select the user or group. Verify that users and groups are correctly displayed under the Users and Groups tabs respectively.
Select the Current subscription. In this example, we selected choose Q Business Lite for groups. Choose the same subscription for users under the Users tab. You can also update subscriptions after creating the application.
Leave the Service role name as default and choose Create application.

Sync S3 data source

With your application created, you will crawl and index the documents in the S3 bucket created at the beginning of the process.

Select the name of the application

Go to the Data sources Select the radio button next to the S3 data source and choose Sync now.

The sync can take from a few minutes to a few hours. Wait for the sync to complete. Verify the sync is complete and documents have been added.

Run queries with Amazon Q

Now that you have configured the Amazon Q application and integrated it with IAM Identity Center, you can test queries from different users based on their group permissions. This will demonstrate how Amazon Q respects the access control rules set up in the Amazon S3 data source.

You have three users for testing—Pat from the Customer group, Mary from the AWS-SA group, and Arnav who isn’t part of any group. According to the access control list (ACL) configuration, Pat should have access to blogs and user guides, Mary should have access to blogs, user guides, case studies, analyst reports, and whitepapers, and Arnav should have access only to blogs.

In the following steps, you will sign in as each user and ask various questions to see what responses Amazon Q provides based on the permitted document types for their respective groups. You will also test edge cases where users try to access information from restricted sources to validate the access control functionality.

In the Amazon Q Business console, choose Applications on the navigation pane and copy the Web experience URL.

Sign in as Pat to the Amazon Q chat interface.

Pat is part of the Customer group and has access to blogs and user guides

When asked a question like “What is AWS?” Amazon Q will provide a summary pulling information from blogs and user guides, highlighting the sources at the end of each excerpt.

Try asking a question that requires information from user guides, such as “How do I set up an AWS account?” Amazon Q will summarize relevant details from the permitted user guide sources for Pat’s group.

However, if you, as Pat, ask a question that requires information from whitepapers, analyst reports, or case studies, Amazon Q will indicate that it could not find any relevant information from the sources she has access to.

Ask a question such as “What are the strategic planning assumptions for the year 2025?” to see this.

Sign in as Mary to the Amazon Q chat interface.

Sign out as user Pat. Start a new incognito browser session or use a different browser. Copy the web experience URL and sign in as user Mary. Repeat these steps each time you need to sign in as a different user.

Mary is part of the AWS-SA group, so she has access to blogs, case studies, analyst reports, and whitepapers.

When Mary asks the same question about strategic planning, Amazon Q will provide a comprehensive summary pulling information from all the permitted sources.

With Mary’s sign-in, you can ask various other questions related to AWS services, architectures, or solutions, and Amazon Q will effectively summarize information from across all the content types Mary’s group has access to.

Sign in as Arnav to the Amazon Q chat interface

Arnav is not part of any group and is able to access only blogs. If Arnav asks a question about Amazon Polly, Amazon Q will return blog posts.

When Arnav tries to get information from the user guides, access is restricted. If they ask about something like how to set up an AWS account, Amazon Q responds that it could not find relevant information.

This shows how Amazon Q respects the data access rules configured in the Amazon S3 data source, allowing users to gain insights only from the content their group has permissions to view, while still providing comprehensive answers when possible within those boundaries.

Troubleshooting

Troubleshooting your Amazon S3 connector provides information about error codes you might see for the Amazon S3 connector and suggested troubleshooting actions. If you encounter an HTTP status code 403 (Forbidden) error when you open your Amazon Q Business application, it means that the user is unable to access the application. See Troubleshooting Amazon Q Business and identity provider integration for common causes and how to address them.

Frequently asked questions

Q. Why isn’t Amazon Q Business answering any of my questions?

A. Verify that you have synced your data source on the Amazon Q console. Also, check the ACLs to ensure you have the required permissions to retrieve answers from Amazon Q.

Q. How can I sync documents without ACLs?

A. When configuring the Amazon S3 connector, under Sync scope, you can optionally choose not to include the metadata or ACL configuration file location in Advanced settings. This will allow you to sync documents without ACLs.

Q. I updated the contents of my S3 data source but Amazon Q business answers using old data.

A. After content has been updated in your S3 data source location, you must re-sync the contents for the updated data to be picked up by Amazon Q. Go to the Data sources Select the radio button next to the S3 data source and choose Sync now. After the sync is complete, verify that the updated data is reflected by running queries on Amazon Q.

Q. I am unable to sign in as a new user through the web experience URL.

A. Clear your browser cookies and sign in as a new user.

Q. I keep trying to sign in but am getting this error:

A. Try signing in from a different browser or clear browser cookies and try again.

Q. What are the supported document formats and what is considered a document in Amazon S3?

A. See Supported document types and What is a document? to learn more.

Call to action

Explore other features in Amazon Q Business such as:

The Amazon Q Business document enrichment feature helps you control both what documents and document attributes are ingested into your index and also how they’re ingested. Using document enrichment, you can create, modify, or delete document attributes and document content when you ingest them into your Amazon Q Business index. For example, you can scrub personally identifiable information (PII) by choosing to delete any document attributes related to PII.
Amazon Q Business features
- Filtering using metadata – Use document attributes to customize and control users’ chat experience. Currently supported only if you use the Amazon Q Business API.
- Source attribution with citations – Verify responses using Amazon Q Business source attributions.
- Upload files and chat – Let users upload files directly into chat and use uploaded file data to perform web experience tasks.
- Quick prompts – Feature sample prompts to inform users of the capabilities of their Amazon Q Business web experience.
To improve retrieved results and customize the user chat experience, you can map document attributes from your data sources to fields in your Amazon Q index. Learn more by exploring Amazon Q Business Amazon S3 data source connector field mappings.

Clean up

To avoid incurring future charges and to clean out unused roles and policies, delete the resources you created: the Amazon Q application, data sources, and corresponding IAM roles.

To delete the Amazon Q application, go to the Amazon Q console and, on the Applications page, select your application.
On the Actions drop-down menu, choose Delete.
To confirm deletion, enter delete in the field and choose Delete. Wait until you get the confirmation message; the process can take up to 15 minutes.
To delete the S3 bucket created in Prepare your S3 bucket as a data source, empty the bucket and then follow the steps to delete the bucket.
Delete your IAM Identity Center instance.

Conclusion

This blog post has walked you through the steps to build a secure, permissions-based generative AI solution using Amazon Q and Amazon S3 as the data source. By configuring user groups and mapping their access privileges to different document folders in S3, it demonstrated that Amazon Q respects these access control rules. When users query the AI assistant, it provides comprehensive responses by analyzing only the content their group has permission to view, preventing unauthorized access to restricted information. This solution allows organizations to safely unlock insights from their data repositories using generative AI while ensuring data access governance.

Don’t let your data’s potential go untapped. Continue exploring how Amazon Q can transform your enterprise data to gain actionable insights. Join the conversation and share your thoughts or questions in the comments section below.

About the Author

Kruthi Jayasimha Rao is a Partner Solutions Architect with a focus in AI and ML. She provides technical guidance to AWS Partners in following best practices to build secure, resilient, and highly available solutions in the AWS Cloud.

Keagan Mirazee is a Partner Solutions Architect specializing in Generative AI to assist AWS Partners in engineering reliable and scalable cloud solutions.

Dipti Kulkarni is a Sr. Software Development Engineer for Amazon Q. Dipti is a passionate engineer building connectors for Amazon Q.

Boosting Salesforce Einstein’s code generating model performance with Amazon SageMaker

July 24, 2024

by Pawan Agarwal Amazon AWS

This post is a joint collaboration between Salesforce and AWS and is being cross-published on both the Salesforce Engineering Blog and the AWS Machine Learning Blog.

Salesforce, Inc. is an American cloud-based software company headquartered in San Francisco, California. It provides customer relationship management (CRM) software and applications focused on sales, customer service, marketing automation, ecommerce, analytics, and application development. Salesforce is building toward artificial general intelligence (AGI) for business, enabling predictive and generative functions within their flagship software-as-a-service (SaaS) CRM, and working toward intelligent automations using artificial intelligence (AI) as well as agents.

Salesforce Einstein is a set of AI technologies that integrate with Salesforce’s Customer Success Platform to help businesses improve productivity and client engagement. Einstein has a list of over 60 features, unlocked at different price points and segmented into four main categories: machine learning (ML), natural language processing (NLP), computer vision, and automatic speech recognition. Einstein delivers advanced AI capabilities into sales, service, marketing, and other functions, empowering companies to deliver more personalized and predictive customer experiences. Einstein has out-of-the-box AI features such as sales email generation in Sales Cloud and service replies in Service Cloud. They also have tools such as Copilot, Prompt, and Model Builder, three tools contained in the Einstein 1 Studio, that allow organizations to build custom AI functionality and roll it out to their users.

The Salesforce Einstein AI Platform team is the group supporting development of Einstein applications. They are committed to enhancing the performance and capabilities of AI models, with a particular focus on large language models (LLMs) for use with Einstein product offerings. These models are designed to provide advanced NLP capabilities for various business applications. Their mission is to continuously refine these LLMs and AI models by integrating state-of-the-art solutions and collaborating with leading technology providers, including open source communities and public cloud services like AWS and building it into a unified AI platform. This helps make sure Salesforce customers receive the most advanced AI technology available.

In this post, we share how the Salesforce Einstein AI Platform team boosted latency and throughput of their code generation LLM using Amazon SageMaker.

The challenge with hosting LLMs

In the beginning of 2023, the team started looking at solutions to host CodeGen, Salesforce’s in-house open source LLM for code understanding and code generation. The CodeGen model allows users to translate natural language, such as English, into programming languages, such as Python. Because they were already using AWS for inference for their smaller predictive models, they were looking to extend the Einstein platform to help them host CodeGen. Salesforce developed an ensemble of CodeGen models (Inline for automatic code completion, BlockGen for code block generation, and FlowGPT for process flow generation) specifically tuned for the Apex programming language. Salesforce Apex is a certified framework for building SaaS apps on top of Salesforce’s CRM functionality. They were looking for a solution that can securely host their model and help them handle a large volume of inference requests as well as multiple concurrent requests at scale. They also needed to be able to meet their throughput and latency requirements for their co-pilot application (EinsteinGPT for Developers). EinsteinGPT for Developers simplifies the start of development by creating smart Apex based on natural language prompts. Developers can accelerate coding tasks by scanning for code vulnerabilities and getting real-time code suggestions within the Salesforce integrated development environment (IDE), as shown in the following screenshot.

The Einstein team conducted a comprehensive evaluation of various tools and services, including open source options and paid solutions. After assessing these options, they found that SageMaker provided the best access to GPUs, scalability, flexibility, and performance optimizations for a wide range of scenarios, particularly in addressing their challenges with latency and throughput.

Why Salesforce Einstein chose SageMaker

SageMaker offered several specific features that proved essential to meeting Salesforce’s requirements:

Multiple serving engines – SageMaker includes specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference (LMI) containers. LMI containers are a set of high-performance Docker Containers purpose built for LLM inference. With these containers, you can use high performance open source inference libraries like FasterTransformer, TensorRT-LLM, vLLM and Transformers NeuronX. These containers bundle together a model server with open source inference libraries to deliver an all-in-one LLM serving solution. The Einstein team liked how SageMaker provided quick-start notebooks that get them deploying these popular open source models in minutes.
Advanced batching strategies – The SageMaker LMI allows customers to optimize performance of their LLMs by enabling features like batching, which groups multiple requests together before they hit the model. Dynamic batching instructs the server to wait a predefined amount of time and batch up all requests that occur in that window with a maximum of 64 requests, while paying attention to a configured preferred size. This optimizes the use of GPU resources and balances throughput with latency, ultimately reducing the latter. The Einstein team liked how they were able to use dynamic batching through the LMI to increase throughput for their Codegen models while minimizing latency.
Efficient routing strategy – By default, SageMaker endpoints have a random routing strategy. SageMaker also supports a least outstanding requests (LOR) strategy, which allows SageMaker to optimally route requests to the instance that’s best suited to serve that request. SageMaker makes this possible by monitoring the load of the instances behind your endpoint and the models or inference components that are deployed on each instance. Customers have the flexibility to choose either algorithm depending on their workload needs. Along with the capability to handle multiple model instances across several GPUs, the Einstein team liked how the SageMaker routing strategy ensures that traffic is evenly and efficiently distributed to model instances, preventing any single instance from becoming a bottleneck.
Access to high-end GPUs – SageMaker provides access to top-end GPU instances, which are essential for running LLMs efficiently. This is particularly valuable given the current market shortages of high-end GPUs. SageMaker allowed the Einstein team to use auto-scaling of these GPUs to meet demand without manual intervention.
Rapid iteration and deployment – While not directly related to latency, the ability to quickly test and deploy changes using SageMaker notebooks helps in reducing the overall development cycle, which can indirectly impact latency by accelerating the implementation of performance improvements. The use of notebooks enabled the Einstein team to shorten their overall deployment time and get their models hosted in production much faster.

These features collectively help optimize the performance of LLMs by reducing latency and improving throughput, making Amazon SageMaker a robust solution for managing and deploying large-scale machine learning models.

One of the key capabilities was how using SageMaker LMI provided a blueprint of model performance optimization parameters for NVIDIA’s FasterTransformer library to use with CodeGen. When the team initially deployed CodeGen 2.5, a 7B parameter model on Amazon Elastic Compute Cloud (Amazon EC2), the model wasn’t performing well for inference. Initially, for a code block generation task, it could only handle six requests per minute, with each request taking over 30 seconds to process. This was far from efficient and scalable. However, after using the SageMaker FasterTransformer LMI notebook and referencing the advanced SageMaker-provided guides to understand how to optimize the different endpoint parameters provided, there was a significant improvement in model performance. The system now handles around 400 requests per minute with a reduced latency of approximately seven seconds per request, each containing about 512 tokens. This represents an over 6,500 percent increase in throughput after optimization. This enhancement was a major breakthrough, demonstrating how the capabilities of SageMaker were instrumental in optimizing the throughput of the LLM and reducing cost. (The FasterTransformer backend has been deprecated by NVIDIA; the team is working toward migrating to the TensorRT (TRT-LLM) LMI.)

To assess the performance of LLMs, the Einstein team focuses on two key metrics:

Throughput – Measured by the number of tokens an LLM can generate per second
Latency – Determined by the time it takes to generate these tokens for individual requests

Extensive performance testing and benchmarking was conducted to track these metrics. Before using SageMaker, CodeGen models had a lower token-per-second rate and higher latencies. With SageMaker optimization, the team observed significant improvements in both throughput and latency, as shown in the following figure.

Latency and throughput changes with different techniques for CodeGen1 and CodeGen2.5 models. CodeGen1 is the original version of CodeGen, which is a 16B model. CodeGen2.5 is the optimized version, which is a 7B model. For more information about CodeGen 2.5, refer to CodeGen2.5: Small, but mighty.

New challenges and opportunities

The primary challenge that the team faced when integrating SageMaker was enhancing the platform to include specific functionalities that were essential for their projects. For instance, they needed additional features for NVIDIA’s FasterTransformer to optimize their model performance. Through a productive collaboration with the SageMaker team, they successfully integrated this support, which initially was not available.

Additionally, the team identified an opportunity to improve resource efficiency by hosting multiple LLMs on a single GPU instance. Their feedback helped develop the inference component feature, which now allows Salesforce and other SageMaker users to utilize GPU resources more effectively. These enhancements were crucial in tailoring the platform to Salesforce’s specific needs.

Key takeaways

The team took away the following key lessons from optimizing models in SageMaker for future projects:

Stay updated – It’s crucial to keep up with the latest inferencing engines and optimization techniques because these advancements significantly influence model optimization.
Tailor optimization strategies – Model-specific optimization strategies like batching and quantization require careful handling and coordination, because each model might require a tailored approach.
Implement cost-effective model hosting – You can optimize the allocation of limited GPU resources to control expenses. Techniques such as virtualization can be used to host multiple models on a single GPU, reducing costs.
Keep pace with innovations – The field of model inferencing is rapidly evolving with technologies like Amazon SageMaker JumpStart and Amazon Bedrock. Developing strategies for adopting and integrating these technologies is imperative for future optimization efforts.

Conclusion

In this post, we shared how the Salesforce Einstein AI Platform team boosted latency and throughput of their code generation LLM using SageMaker, and saw an over 6,500 percent increase in throughput after optimization.

Looking to host your own LLMs on SageMaker? To get started, see this guide.

_______________________________________________________________________

About the Authors

Pawan Agarwal is the Senior Director of Software Engineering at Salesforce. He leads efforts in Generative and Predictive AI, focusing on inferencing, training, fine-tuning, and notebooking technologies that power the Salesforce Einstein suite of applications.

Rielah De Jesus is a Principal Solutions Architect at AWS who has successfully helped various enterprise customers in the DC, Maryland, and Virginia area move to the cloud. In her current role she acts as a customer advocate and technical advisor focused on helping organizations like Salesforce achieve success on the AWS platform. She is also a staunch supporter of Women in IT and is very passionate about finding ways to creatively use technology and data to solve everyday challenges.

Amazon Robotics names 2024 Day One Fellowship Program recipients

July 24, 2024

by Amazon AWS

Program empowers uniquely merited scholars from backgrounds historically underrepresented in STEM to become industry leaders through scholarship, research, and career opportunities.Read More

Detect and protect sensitive data with Amazon Lex and Amazon CloudWatch Logs

July 23, 2024

by Rashmica Gopinath Amazon AWS

In today’s digital landscape, the protection of personally identifiable information (PII) is not just a regulatory requirement, but a cornerstone of consumer trust and business integrity. Organizations use advanced natural language detection services like Amazon Lex for building conversational interfaces and Amazon CloudWatch for monitoring and analyzing operational data.

One risk many organizations face is the inadvertent exposure of sensitive data through logs, voice chat transcripts, and metrics. This risk is exacerbated by the increasing sophistication of cyber threats and the stringent penalties associated with data protection violations. Dealing with massive datasets is not just about identifying and categorizing PII. The challenge also lies in implementing robust mechanisms to obfuscate and redact this sensitive data. At the same time, it’s crucial to make sure these security measures don’t undermine the functionality and analytics critical to business operations.

This post addresses this pressing pain point, offering prescriptive guidance on safeguarding PII through detection and masking techniques specifically tailored for environments using Amazon Lex and CloudWatch Logs.

Solution overview

To address this critical challenge, our solution uses the slot obfuscation feature in Amazon Lex and the data protection capabilities of CloudWatch Logs, tailored specifically for detecting and protecting PII in logs.

In Amazon Lex, slots are used to capture and store user input during a conversation. Slots are placeholders within an intent that represent an action the user wants to perform. For example, in a flight booking bot, slots might include departure city, destination city, and travel dates. Slot obfuscation makes sure any information collected through Amazon Lex conversational interfaces, such as names, addresses, or any other PII entered by users, is obfuscated at the point of capture. This method reduces the risk of sensitive data exposure in chat logs and playbacks.

In CloudWatch Logs, data protection and custom identifiers add an additional layer of security by enabling the masking of PII within session attributes, input transcripts, and other sensitive log data that is specific to your organization.

This approach minimizes the footprint of sensitive information across these services and helps with compliance with data protection regulations.

In the following sections, we demonstrate how to identify and classify your data, locate your sensitive data, and finally monitor and protect it, both in transit and at rest, especially in areas where it may inadvertently appear. The following are the four ways to do this:

Amazon Lex – Monitor and protect data with Amazon Lex using slot obfuscation and selective conversation log capture
CloudWatch Logs – Monitor and protect data with CloudWatch Logs using playbacks and log group policies
Amazon S3 – Monitor and protect data with Amazon Simple Storage Service (Amazon S3) using bucket security and encryption
Service Control Policies – Monitor and protect with data governance controls and risk management policies using Service Control Policies (SCPs) to prevent changes to Amazon Lex chatbots and CloudWatch Logs groups, and restrict unmasked data viewing in CloudWatch Logs Insights

Identify and classify your data

The first step is to identify and classify the data flowing through your systems. This involves understanding the types of information processed and determining their sensitivity level.

To determine all the slots in an intent in Amazon Lex, complete the following steps:

On the Amazon Lex console, choose Bots in the navigation pane.
Choose your preferred bot.
In the navigation pane, choose the locale under All Languages and choose Intents.
Choose the required intent from the list.
In the Slots section, make note of all the slots within the intent.

After you identify the slots within the intent, it’s important to classify them according to their sensitivity level and the potential impact of unauthorized access or disclosure. For example, you may have the following data types:

Name
Address
Phone number
Email address
Account number

Email address and physical mailing address are often considered a medium classification level. Sensitive data, such as name, account number, and phone number, should be tagged with a high classification level, indicating the need for stringent security measures. These guidelines can help with systematically evaluating data.

Locate your data stores

After you classify the data, the next step is to locate where this data resides or is processed in your systems and applications. For services involving Amazon Lex and CloudWatch, it’s crucial to identify all data stores and their roles in handling PII.

CloudWatch captures logs generated by Amazon Lex, including interaction logs that might contain PII. Regular audits and monitoring of these logs are essential to detect any unauthorized access or anomalies in data handling.

Amazon S3 is often used in conjunction with Amazon Lex for storing call recordings or transcripts, which may contain sensitive information. Making sure these storage buckets are properly configured with encryption, access controls, and lifecycle policies are vital to protect the stored data.

Organizations can create a robust framework for protection by identifying and classifying data, along with pinpointing the data stores (like CloudWatch and Amazon S3). This framework should include regular audits, access controls, and data encryption to prevent unauthorized access and comply with data protection laws.

Monitor and protect data with Amazon Lex

In this section, we demonstrate how to protect your data with Amazon Lex using slot obfuscation and selective conversation log capture.

Slot obfuscation in Amazon Lex

Sensitive information can appear in the input transcripts of conversation logs. It’s essential to implement mechanisms that detect and mask or redact PII in these transcripts before they are stored or logged.

In the development of conversational interfaces using Amazon Lex, safeguarding PII is crucial to maintain user privacy and comply with data protection regulations. Slot obfuscation provides a mechanism to automatically obscure PII within conversation logs, making sure sensitive information is not exposed. When configuring an intent within an Amazon Lex bot, developers can mark specific slots—placeholders for user-provided information—as obfuscated. This setting tells Amazon Lex to replace the actual user input for these slots with a placeholder in the logs. For instance, enabling obfuscation for slots designed to capture sensitive information like account numbers or phone numbers makes sure any matching input is masked in the conversation log. Slot obfuscation allows developers to significantly reduce the risk of inadvertently logging sensitive information, thereby enhancing the privacy and security of the conversational application. It’s a best practice to identify and mark all slots that could potentially capture PII during the bot design phase to provide comprehensive protection across the conversation flow.

To enable obfuscation for a slot from the Amazon Lex console, complete the following steps:

On the Amazon Lex console, choose Bots in the navigation pane.
Choose your preferred bot.
In the navigation pane, choose the locale under All Languages and choose Intents.
Choose your preferred intent from the list.
In the Slots section, expand the slot details.
Choose Advanced options to access additional settings.
Select Enable slot obfuscation.
Choose Update slot to save the changes.

Selective conversation log capture

Amazon Lex offers capabilities to select how conversation logs are captured with text and audio data from live conversations by enabling the filtering of certain types of information from the conversation logs. Through selective capture of necessary data, businesses can minimize the risk of exposing private or confidential information. Additionally, this feature can help organizations comply with data privacy regulations, because it gives more control over the data collected and stored. There is a choice between text, audio, or text and audio logs.

When selective conversation log capture is enabled for text and audio logs, it disables logging for all intents and slots in the conversation. To generate text and audio logs for particular intents and slots, set the text and audio selective conversation log capture session attributes for those intents and slots to “true”. When selective conversation log capture is enabled, any slot values in SessionState, Interpretations, and Transcriptions for which logging is not enabled using session attributes will be obfuscated in the generated text log.

To enable selective conversation log capture, complete the following steps:

On the Amazon Lex console, choose Bots in the navigation pane.
Choose your preferred bot.
Choose Aliases under Deployment and choose the bot’s alias.
Choose Manage conversation logs.
Select Selectively log utterances.
1. For text logs, choose a CloudWatch log group.
2. For audio logs, choose an S3 bucket to store the logs and assign an AWS Key Management Service (AWS KMS) key for added security.
Save the changes.

Now selective conversation log capture for a slot is activated.

Choose Intents in the navigation pane and choose your intent.
Under Initial responses, choose Advanced options and expand Set values.
For Session attributes, set the following attributes based on the intents and slots for which you want to enable selective conversation log capture. This will capture utterances that contain only a specific slot in the conversation.
1. x-amz-lex:enable-audio-logging:<intent>:<slot> = "true"
2. x-amz-lex:enable-text-logging:<intent>:<slot> = "true"
Choose Update options and rebuild the bot.

Replace <intent> and <slot> with respective intent and slot names.

Monitor and protect data with CloudWatch Logs

In this section, we demonstrate how to protect your data with CloudWatch using playbacks and log group policies.

Playbacks in CloudWatch Logs

When Amazon Lex engages in interactions, delivering prompts or messages from the bot to the customer, there’s a potential risk for PII to be inadvertently included in these communications. This risk extends to CloudWatch Logs, where these interactions are recorded for monitoring, debugging, and analysis purposes. The playback of prompts or messages designed to confirm or clarify user input can inadvertently expose sensitive information if not properly handled. To mitigate this risk and protect PII within these interactions, a strategic approach is necessary when designing and deploying Amazon Lex bots.

The solution lies in carefully structuring how slot values, which may contain PII, are referenced and used in the bot’s response messages. Adopting a prescribed format for passing slot values, specifically by encapsulating them within curly braces (for example, {slotName}), allows developers to control how this information is presented back to the user and logged in CloudWatch. This method makes sure that when the bot constructs a message, it refers to the slot by its name rather than its value, thereby preventing any sensitive information from being directly included in the message content. For example, instead of the bot saying, “Is your phone number 123-456-7890? ” it would use a generic placeholder, “Is your phone number {PhoneNumber}? ” with {PhoneNumber} being a reference to the slot that captured the user’s phone number. This approach allows the bot to confirm or clarify information without exposing the actual data.

When these interactions are logged in CloudWatch, the logs will only contain the slot name references, not the actual PII. This technique significantly reduces the risk of sensitive information being exposed in logs, enhancing privacy and compliance with data protection regulations. Organizations should make sure all personnel involved in bot design and deployment are trained on these practices to consistently safeguard user information across all interactions.

The following is a sample AWS Lambda function code in Python for referencing the slot value of a phone number provided by the user. SML tags are used to format the slot value to provide slow and clear speech output, and returning a response to confirm the correctness of the captured phone number:

def lambda_handler(event, context):
    # Extract the intent name from the event
    intent_name = event['sessionState']['intent']['name']
    # Extract the slots from the event
    slots = event['sessionState']['intent']['slots']

    # Check if the intent name is 'INTENT_NAME'
     if intent_name == 'INTENT_NAME':
         # Retrieve the phone number from the 'SLOT_NAME' slot
         phone_number = slots['SLOT_NAME']['value']['interpretedValue']
        
        # Create an SSML-formatted message with the phone number
        msg = f'''<speak>
                Thank you for providing your phone number. Is 
                <prosody rate="slow">
                <say-as interpret-as="telephone">{phone_number}</say-as>
                </prosody> correct?
                </speak>'''
        
        # Create a message array
        message_array = [
            {
                'contentType': 'SSML',
                'content': msg
            }
        ]
        
        # Response with the dialog action, intent state, and the message array
        response = {
            'sessionState': {
                'dialogAction': {
                    'type': 'Close'
                },
                'intent': {
                    'name': intent_name,
                    'state': 'Fulfilled'
                }
            },
            'messages': message_array
        }
    else:
        # Generic response for unhandled intents
        response = {
            'sessionState': {
                'dialogAction': {
                    'type': 'Close'
                },
                'intent': {
                    'name': intent_name,
                    'state': 'Fulfilled'
                }
            },
            'messages': [
                {
                    'contentType': 'PlainText',
                    'content': 'I apologize, but I am unable to assist.'
                }
            ]
        }
    return response

Replace INTENT_NAME and SLOT_NAME with your preferred intent and slot names, respectively.

CloudWatch data protection log group policies for data identifiers

Sensitive data that’s ingested by CloudWatch Logs can be safeguarded by using log group data protection policies. These policies allow to audit and mask sensitive data that appears in log events ingested by the log groups in your account.

CloudWatch Logs supports both managed and custom data identifiers.

Managed data identifiers offer preconfigured data types to protect financial data, personal health information (PHI), and PII. For some types of managed data identifiers, the detection depends on also finding certain keywords in proximity with the sensitive data.

Each managed data identifier is designed to detect a specific type of sensitive data, such as name, email address, account numbers, AWS secret access keys, or passport numbers for a particular country or region. When creating a data protection policy, you can configure it to use these identifiers to analyze logs ingested by the log group, and take actions when they are detected.

CloudWatch Logs data protection can detect the categories of sensitive data by using managed data identifiers.

To configure managed data identifiers on the CloudWatch console, complete the following steps:

On the CloudWatch console, under Logs in the navigation pane, choose Log groups.
Select your log group and on the Actions menu, choose Create data protection policy.
Under Auditing and masking configuration, for Managed data identifiers, select all the identifiers for which data protection policy should be applied.
Choose the data store to apply the policy to and save the changes.

Custom data identifiers let you define your own custom regular expressions that can be used in your data protection policy. With custom data identifiers, you can target business-specific PII use cases that managed data identifiers don’t provide. For example, you can use custom data identifiers to look for a company-specific account number format.

To create a custom data identifier on the CloudWatch console, complete the following steps:

On the CloudWatch console, under Logs in the navigation pane, choose Log groups.
Select your log group and on the Actions menu, choose Create data protection policy.
Under Custom Data Identifier configuration, choose Add custom data identifier.
Create your own regex patterns to identify sensitive information that is unique to your organization or specific use case.
After you add your data identifier, choose the data store to apply this policy to.
Choose Activate data protection.

For details about the types of data that can be protected, refer to Types of data that you can protect.

Monitor and protect data with Amazon S3

In this section, we demonstrate how to protect your data in S3 buckets.

Encrypt audio recordings in S3 buckets

PII can often be captured in audio recordings, especially in sectors like customer service, healthcare, and financial services, where sensitive information is frequently exchanged over voice interactions. To comply with domain-specific regulatory requirements, organizations must adopt stringent measures for managing PII in audio files.

One approach is to disable the recording feature entirely if it poses too high a risk of non-compliance or if the value of the recordings doesn’t justify the potential privacy implications. However, if audio recordings are essential, streaming the audio data in real time using Amazon Kinesis provides a scalable and secure method to capture, process, and analyze audio data. This data can then be exported to a secure and compliant storage solution, such as Amazon S3, which can be configured to meet specific compliance needs including encryption at rest. You can use AWS KMS or AWS CloudHSM to manage encryption keys, offering robust mechanisms to encrypt audio files at rest, thereby securing the sensitive information they might contain. Implementing these encryption measures makes sure that even if data breaches occur, the encrypted PII remains inaccessible to unauthorized parties.

Configuring these AWS services allows organizations to balance the need for audio data capture with the imperative to protect sensitive information and comply with regulatory standards.

S3 bucket security configurations

You can use an AWS CloudFormation template to configure various security settings for an S3 bucket that stores Amazon Lex data like audio recordings and logs. For more information, see Creating a stack on the AWS CloudFormation console. See the following example code:

AWSTemplateFormatVersion: '2010-09-09'
Description: Create a secure S3 bucket with KMS encryption to store Lex Data
Resources:
  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: YOUR_LEX_DATA_BUCKET
      AccessControl: Private
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: aws:kms
              KMSMasterKeyID: alias/aws/s3 
      VersioningConfiguration:
        Status: Enabled
      ObjectLockConfiguration:
        ObjectLockEnabled: Enabled
        Rule:
          DefaultRetention:
            Mode: GOVERNANCE
            Years: 5
      LoggingConfiguration:
        DestinationBucketName: !Ref YOUR_SERVER_ACCESS_LOG_BUCKET
        LogFilePrefix: lex-bucket-logs/

The template defines the following properties:

BucketName– Specifies your bucket. Replace YOUR_LEX_DATA_BUCKET with your preferred bucket name.
AccessControl – Sets the bucket access control to Private, denying public access by default.
PublicAccessBlockConfiguration – Explicitly blocks all public access to the bucket and its objects
BucketEncryption – Enables server-side encryption using the default KMS encryption key ID, alias/aws/s3, managed by AWS for Amazon S3. You can also create custom KMS keys. For instructions, refer to Creating symmetric encryption KMS keys
VersioningConfiguration – Enables versioning for the bucket, allowing you to maintain multiple versions of objects.
ObjectLockConfiguration – Enables object lock with a governance mode retention period of 5 years, preventing objects from being deleted or overwritten during that period.
LoggingConfiguration – Enables server access logging for the bucket, directing log files to a separate logging bucket for auditing and analysis purposes. Replace YOUR_SERVER_ACCESS_LOG_BUCKET with your preferred bucket name.

This is just an example; you may need to adjust the configurations based on your specific requirements and security best practices.

Monitor and protect with data governance controls and risk management policies

In this section, we demonstrate how to protect your data with using a Service Control Policy (SCP). To create an SCP, see Creating an SCP.

Prevent changes to an Amazon Lex chatbot using an SCP

To prevent changes to an Amazon Lex chatbot using an SCP, create one that denies the specific actions related to modifying or deleting the chatbot. For example, you could use the following SCP:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "lex:DeleteBot",
        "lex:DeleteBotAlias",
        "lex:DeleteBotChannelAssociation",
        "lex:DeleteBotVersion",
        "lex:DeleteIntent",
        "lex:DeleteSlotType",
        "lex:DeleteUtterances",
        "lex:PutBot",
        "lex:PutBotAlias",
        "lex:PutIntent",
        "lex:PutSlotType"
      ],
      "Resource": [
        "arn:aws:lex:*:YOUR_ACCOUNT_ID:bot:YOUR_BOT_NAME",
        "arn:aws:lex:*:YOUR_ACCOUNT_ID:intent:YOUR_BOT_NAME:*",
        "arn:aws:lex:*:YOUR_ACCOUNT_ID:slottype:YOUR_BOT_NAME:*"
      ],
      "Condition": {
        "StringEquals": {
          "aws:PrincipalArn": "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_IAM_ROLE"
        }
      }
    }
  ]
}

The code defines the following:

Effect – This is set to Deny, which means that the specified actions will be denied.
Action – This contains a list of actions related to modifying or deleting Amazon Lex bots, bot aliases, intents, and slot types.
Resource – This lists the Amazon Resource Names (ARNs) for your Amazon Lex bot, intents, and slot types. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_BOT_NAME with the name of your Amazon Lex bot.
Condition – This makes sure the policy only applies to actions performed by a specific IAM role. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_IAM_ROLE with the name of the AWS Identity and Access Management (IAM) provisioned role you want this policy to apply to.

When this SCP is attached to an AWS Organizations organizational unit (OU) or an individual AWS account, it will allow only the specified provisioning role while preventing all other IAM entities (users, roles, or groups) within that OU or account from modifying or deleting the specified Amazon Lex bot, intents, and slot types.

This SCP only prevents changes to the Amazon Lex bot and its components. It doesn’t restrict other actions, such as invoking the bot or retrieving its configuration. If more actions need to be restricted, you can add them to the Action list in the SCP.

Prevent changes to a CloudWatch Logs log group using an SCP

To prevent changes to a CloudWatch Logs log group using an SCP, create one that denies the specific actions related to modifying or deleting the log group. The following is an example SCP that you can use:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "logs:DeleteLogGroup",
        "logs:PutRetentionPolicy"
      ],
      "Resource": "arn:aws:logs:*:YOUR_ACCOUNT_ID:log-group:/aws/YOUR_LOG_GROUP_NAME*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalArn": "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_IAM_ROLE"
        }
      }
    }
  ]
}

The code defines the following:

Effect – This is set to Deny, which means that the specified actions will be denied.
Action – This includes logs:DeleteLogGroup and logs:PutRetentionPolicy actions, which prevent deleting the log group and modifying its retention policy, respectively.
Resource – This lists the ARN for your CloudWatch Logs log group. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_LOG_GROUP_NAME with the name of your log group.
Condition – This makes sure the policy only applies to actions performed by a specific IAM role. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_IAM_ROLE with the name of the IAM provisioned role you want this policy to apply to.

Similar to the preceding chatbot SCP, when this SCP is attached to an Organizations OU or an individual AWS account, it will allow only the specified provisioning role to delete the specified CloudWatch Logs log group or modify its retention policy, while preventing all other IAM entities (users, roles, or groups) within that OU or account from performing these actions.

This SCP only prevents changes to the log group itself and its retention policy. It doesn’t restrict other actions, such as creating or deleting log streams within the log group or modifying other log group configurations. To restrict additional actions, add it to the Action list in the SCP.

Also, this SCP will apply to all log groups that match the specified resource ARN pattern. To target a specific log group, modify the Resource value accordingly.

Restrict viewing of unmasked sensitive data in CloudWatch Logs Insights using an SCP

When you create a data protection policy, by default, any sensitive data that matches the data identifiers you’ve selected is masked at all egress points, including CloudWatch Logs Insights, metric filters, and subscription filters. Only users who have the logs:Unmask IAM permission can view unmasked data. The following is an SCP you can use:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictUnmasking",
      "Effect": "Deny",
      "Action": "logs:Unmask",
      "Resource": "arn:aws:logs:*:YOUR_ACCOUNT_ID:log-group:YOUR_LOG_GROUP:*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalArn": "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_IAM_ROLE"
        }
      }
    }
  ]
}

It defines the following:

Effect – This is set to Deny, which means that the specified actions will be denied.
Action – This includes logs:Unmask, which prevents viewing of masked data.
Resource – This lists the ARN for your CloudWatch Logs log group. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_LOG_GROUP_NAME with the name of your log group.
Condition – This makes sure the policy only applies to actions performed by a specific IAM role. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_IAM_ROLE with the name of the IAM provisioned role you want this policy to apply to.

Similar to the previous SCPs, when this SCP is attached to an Organizations OU or an individual AWS account, it will allow only the specified provisioning role while preventing all other IAM entities (users, roles, or groups) within that OU or account from unmasking sensitive data from the CloudWatch Logs log group.

Similar to the previous log group service control policy, this SCP only prevents changes to the log group itself and its retention policy. It doesn’t restrict other actions such as creating or deleting log streams within the log group or modifying other log group configurations. To restrict additional actions, add them to the Action list in the SCP.

Also, this SCP will apply to all log groups that match the specified resource ARN pattern. To target a specific log group, modify the Resource value accordingly.

Clean up

To avoid incurring additional charges, clean up your resources:

Delete the Amazon Lex bot:
1. On the Amazon Lex console, choose Bots in the navigation pane.
2. Select the bot to delete and on the Action menu, choose Delete.
Delete the associated Lambda function:
1. On the Lambda console, choose Functions in the navigation pane.
2. Select the function associated with the bot and on the Action menu, choose Delete.
Delete the account-level data protection policy. For instructions, see DeleteAccountPolicy.
Delete the CloudFormation log group policy:
1. On the CloudWatch console, under Logs in the navigation pane, choose Log groups.
2. Choose your log group.
3. On the Data protection tab, under Log group policy, choose the Actions menu and choose Delete policy.
Delete the S3 bucket that stores the Amazon Lex data:
1. On the Amazon S3 console, choose Buckets in the navigation pane.
2. Select the bucket you want to delete, then choose Delete.
3. To confirm that you want to delete the bucket, enter the bucket name and choose Delete bucket.
Delete the CloudFormation stack. For instructions, see Deleting a stack on the AWS CloudFormation console.
Delete the SCP. For instructions, see Deleting an SCP.
Delete the KMS key. For instructions, see Deleting AWS KMS keys.

Conclusion

Securing PII within AWS services like Amazon Lex and CloudWatch requires a comprehensive and proactive approach. By following the steps in this post—identifying and classifying data, locating data stores, monitoring and protecting data in transit and at rest, and implementing SCPs for Amazon Lex and Amazon CloudWatch—organizations can create a robust security framework. This framework not only protects sensitive data, but also complies with regulatory standards and mitigates potential risks associated with data breaches and unauthorized access.

Emphasizing the need for regular audits, continuous monitoring, and updating security measures in response to emerging threats and technological advancements is crucial. Adopting these practices allows organizations to safeguard their digital assets, maintain customer trust, and build a reputation for strong data privacy and security in the digital landscape.

About the Authors

Rashmica Gopinath is a software development engineer with Amazon Lex. Rashmica is responsible for developing new features, improving the service’s performance and reliability, and ensuring a seamless experience for customers building conversational applications. Rashmica is dedicated to creating innovative solutions that enhance human-computer interaction. In her free time, she enjoys winding down with the works of Dostoevsky or Kafka.

Dipkumar Mehta is a Principal Consultant with the Amazon ProServe Natural Language AI team. He focuses on helping customers design, deploy, and scale end-to-end Conversational AI solutions in production on AWS. He is also passionate about improving customer experience and driving business outcomes by leveraging data. Additionally, Dipkumar has a deep interest in Generative AI, exploring its potential to revolutionize various industries and enhance AI-driven applications.

David Myers is a Sr. Technical Account Manager with AWS Enterprise Support . With over 20 years of technical experience observability has been part of his career from the start. David loves improving customers observability experiences at Amazon Web Services.

Sam Patel is a Security Consultant specializing in safeguarding Generative AI (GenAI), Artificial Intelligence systems, and Large Language Models (LLM) for Fortune 500 companies. Serving as a trusted advisor, he invents and spearheads the development of cutting-edge best practices for secure AI deployment, empowering organizations to leverage transformative AI capabilities while maintaining stringent security and privacy standards.

AWS AI chips deliver high performance and low cost for Llama 3.1 models on AWS

July 23, 2024

by John Gray Amazon AWS

Today, we are excited to announce AWS Trainium and AWS Inferentia support for fine-tuning and inference of the Llama 3.1 models. The Llama 3.1 family of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes. In a previous post, we covered how to deploy Llama 3 models on AWS Trainium and Inferentia based instances in Amazon SageMaker JumpStart. In this post, we outline how to get started with fine-tuning and deploying the Llama 3.1 family of models on AWS AI chips, to realize their price-performance benefits.

Overview of Llama 3.1 models

The Llama 3.1 family of multilingual LLMs are a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes (text in/text and code out). All models support long context length (128k) and are optimized for inference with support for grouped query attention (GQA).

The Llama 3.1 instruction tuned models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available publicly available chat models on common industry benchmarks. They have been trained to generate tool calls for a few specific tools for capabilities like search, image generation, code execution, and mathematical reasoning. In addition, they support zero-shot tool use.

Llama 3.1 405B is the world’s largest publicly available LLM according to Meta. The model sets a new standard for artificial intelligence (AI) and is ideal for enterprise-level applications and research and development. It’s ideal for tasks like synthetic data generation, where the outputs of the model can be used to improve smaller Llama models after ﬁne-tuning, and model distillations to transfer knowledge to smaller models from the 405B model. This model excels at general knowledge, long-form text generation, multilingual translation, machine translation, coding, math, tool use, enhanced contextual understanding, and advanced reasoning and decision-making.

Architecturally, the core LLM for Llama 3 and Llama 3.1 has the same dense architecture. They are auto-regressive language models that use an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

The responsible use guide from Meta can assist you in implementing additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations.

Trainium powers Llama 3.1 on Amazon Bedrock and Amazon SageMaker

The fastest way to get started with Llama 3.1 on AWS is through Amazon Bedrock, which is powered by our purpose-built AI infrastructure including AWS Trainium. Through its fully managed API, Amazon Bedrock delivers the benefits of our purpose-built AI infrastructure and simplifies access to these powerful models so you can focus on building differentiated AI applications.

If you need greater control over the underlying resources, you can fine-tune and deploy Llama 3.1 models with SageMaker. Trainium support for Llama 3.1 in SageMaker JumpStart is coming soon.

AWS Trainium and AWS Inferentia2 enable high performance and low cost for Llama 3.1 models

If you want to build your own ML pipelines for training and inference for greater flexibility and control, you can get started with Llama 3.1 on AWS AI chips using Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 instances. Let’s see how you can get started with the new Llama 3.1 8/70B models on Trainium using the AWS Neuron SDK.

Fine-tune Llama 3.1 on Trainium

To get started with fine-tuning either Llama 3.1 8B or Llama 3.1 70B, you can use the NeuronX Distributed library. NeuronX Distributed provides implementations of some of the more popular distributed training and inference techniques. To start fine-tuning, you can use the following samples:

Both samples are built on top of AWS ParallelCluster to manage the Trainium cluster infrastructure and Slurm for workload management. The following is the example Slurm command to initiate training for Llama3.1 70B:

sbatch --exclusive 
--nodes 32 
--cpus-per-task 128 
--wrap="srun bash $(pwd)/run_llama3_70B_tp_pp.sh"

Inside the Slurm script, we launch a distributed training process on our cluster. In the runner scripts, we load the pre-trained weights and configuration provided by Meta, and launch the training process:

torchrun $DISTRIBUTED_ARGS run_llama_nxd.py 
    —train_batch_size $BS 
    —use_meta_device_init 1 
    —training_dir $DATA_PATH 
    —training_config $SCRIPT_DIR/${MODEL_SIZE}config_llama${LLAMA_VERSION} 
    —max_steps $max_steps 
    —seq_len $SEQ_LEN 
    —pipeline_parallel_size $PP_DEGREE 
    —tensor_parallel_size $TP_DEGREE 
    —num_microbatches $NUM_MICROBATCHES 
    —lr 0.000015 
    —min_lr 1e-06 
    —beta1 0.9 
    —beta2 0.95 
    —weight_decay 0.1 
    —warmup_steps 2000 
    —constant_steps 0 
    —use_zero1_optimizer 1 
    —use_selective_checkpoint 1 
    —use_flash_attention 1 
    —qkv_linear 1 
    —kv_replicator 4 
    —pretrained_weight 1 
    —save_load_xser 1 
    —checkpoint_dir "/shared/llama${LLAMA_VERSION}${MODEL_SIZE}/" 
    —checkpoint_freq $checkpoint_freq 
    —num_kept_checkpoint -1 
    —loading_step -1 
    —tb_dir $tb_dir |& tee $LOG_PATH/log
exit ${PIPESTATUS[0]}

Deploy Llama 3.1 on Trainium

When your model is ready to deploy, you can do so by updating the model ID in the previous Llama 3 8B Neuron sample code:

model_id = "meta-llama/Meta-Llama-3.1-8B"
neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=24, amp='bf16', n_positions=4096)
neuron_model.to_neuron()

You can use the same sample inference code:

tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Hello, I'm a language model and I like to"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

For step-by-step details, refer to the new Llama 3.1 examples:

Meta Llama 3.1 8B
Meta Llama 3.1 70B
Meta Llama 3.1 8B 32k
Meta Llama 3.1 405B on Trainium is coming soon

You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker through the Hugging Face Model Hub. From the Llama 3.1 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium. Copy the example code into a SageMaker notebook, then choose Run.

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "fp16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_LENGTH": "3686",
    "MAX_TOTAL_TOKENS": "4096",
    "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>",
}

assert hub["HF_TOKEN"] != "<REPLACE WITH YOUR TOKEN>", "Please replace '<REPLACE WITH YOUR TOKEN>' with your Hugging Face Hub API token"


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.23"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Additionally, if you want to use vLLM to deploy the models, you can refer to the continuous batching guide to create the environment. After you create the environment, you can use vLLM to deploy Llama 3.1 8/70B models on AWS Trainium or Inferentia. The following an example to deploy Llama 3.1 8B:

from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B",
    max_num_seqs=8,
    # The max_model_len and block_size arguments are required to be same as max sequence length,
    # when targeting neuron device. Currently, this is a known limitation in continuous batching
    # support in transformers-neuronx.
    max_model_len=128,
    block_size=128,
    # The device can be automatically detected when AWS Neuron SDK is installed.
    # The device argument can be either unspecified for automated detection, or explicitly assigned.
    device="neuron",
    tensor_parallel_size=8)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Conclusion

AWS Trainium and Inferentia deliver high performance and low cost for fine-tuning and deploying Llama 3.1 models. We are excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, refer to Model Samples and Tutorials in AWS Neuron Documentation.

About the Authors

John Gray is a Sr. Solutions Architect in Annapurna Labs, AWS, based out of Seattle. In this role, John works with customers on their AI and machine learning use cases, architects solutions to cost-effectively solve their business problems, and helps them build a scalable prototype using AWS AI chips.

Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative AI models on AWS AI chips.

Kamran Khan, Head of Business Development for AWS Inferentina/Trianium at AWS. He has over a decade of experience helping customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Use Llama 3.1 405B to generate synthetic data for fine-tuning tasks

July 23, 2024

by Sebastian Bustillo Amazon AWS

Today, we are excited to announce the availability of the Llama 3.1 405B model on Amazon SageMaker JumpStart, and Amazon Bedrock in preview. The Llama 3.1 models are a collection of state-of-the-art pre-trained and instruct fine-tuned generative artificial intelligence (AI) models in 8B, 70B, and 405B sizes. Amazon SageMaker JumpStart is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. Amazon Bedrock offers a straightforward way to build and scale generative AI applications with Meta Llama models, using a single API.

In this post, we show how to use Llama 3.1 405B to generate data (labels for a sample dataset), and how to use the generated data to fine-tune a smaller model like Llama 3 8B to generate better responses compared to the non-fine-tuned model.

Overview of Llama 3.1 405B

The Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes (text in/text out). All models support long context length (128,000) and are optimized for inference with support for grouped query attention (GQA). The Llama 3.1 instruction tuned text-only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the publicly available chat models on common industry benchmarks.

Llama 3.1 405B is the first publicly available model that rivals the top models in AI when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation. There are some unique ways to use it—in addition to direct inference, you can use the Llama 3.1 405B model to generate synthetic data to improve smaller models, and it can be a powerful domain-specific model by acting as the base model for domain-specific fine-tuning.

Llama 3.1 models are available today for inferencing on SageMaker JumpStart and Amazon Bedrock. On SageMaker JumpStart, they are rolling out to all AWS Regions where SageMaker JumpStart is available and support the required instance types. Llama 3.1 405B will require P5 instances on Amazon SageMaker. The Llama 3.1 models are also available today in the us-west-2 Region on Amazon Bedrock, with planned future expanded Regional availability.

Prerequisites

The following prerequisites are needed to implement the steps outlined in this post:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker and Amazon Bedrock. For more information, refer to Identity and Access Management for Amazon SageMaker and Identity and access management for Amazon Bedrock.
Access to Amazon SageMaker Studio or a SageMaker notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code.

Responses from the Llama 3 8B Instruct model

Firstly, we perform inference with the Llama 3 8B model either directly through Amazon Bedrock or a deployed endpoint using SageMaker JumpStart. With Llama 3 Instruct models, which are optimized for dialogue use cases, the input to the model endpoints is the previous history between the chat assistant and the user. We can ask context-aware questions to conversations that have happened so far, using specific formatting for the input text (described in our earlier Llama 3B release posts, Meta Llama 3 models are now available in Amazon Bedrock and Meta Llama 3 models are now available in Amazon SageMaker JumpStart).

In the following example, the user has a conversation with the assistant about tourist sites in Paris. The assistant generated four different recommendation options, and then the user inquires about the first option:

Input: <|begin_of_text|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nnParis, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:nn1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.n2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.nnThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.<|eot_id|><|start_header_id|>user<|end_header_id|>nnWhat is so great about #1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn

Output: The Eiffel Tower! It's an iconic symbol of Paris and one of the most recognizable landmarks in the world. Here are some reasons why it's so great:nn1. **Engineering marvel**: When it was built for the 1889 World's Fair, the Eiffel Tower was a groundbreaking feat of engineering. It was the tallest structure in the world at the time, standing at 324 meters (1,063 feet).n2. **Panoramic views**: The Eiffel Tower offers breathtaking 360-degree views of the city from its observation decks. You can see many of Paris' famous landmarks, such as the Seine River, the Louvre, and Montmartre.n3. **Romantic atmosphere**: The Eiffel Tower is often associated with romance and is a popular spot for proposals, weddings, and romantic dinners. The tower's iron latticework and twinkling lights at night add to its romantic ambiance.n4. **Historical significance**: The Eiffel Tower has played a significant role in French history, serving as a communication tower during World War I and a symbol of French culture and engineering.n5. **Accessibility**: The Eiffel Tower is easily accessible by public transportation or on foot, and there are various ways to experience it, including taking the stairs, elevator, or a guided tour.n6. **Nighttime spectacle**: The Eiffel Tower is even more magical at night, when it's illuminated with thousands of twinkling lights. The light show, known as the "Eiffel Tower Sparkle," takes place every hour on the hour from dusk to midnight.n7. **Iconic status**: The Eiffel Tower has been featured in countless films, literature, and art, making it an instantly recognizable symbol of Paris and France.nnOverall, the Eiffel Tower is a must-visit attraction in Paris, offering a unique combination of history, engineering, romance, and stunning views.

The Llama 3 8B model is able to generate answers for the questions without issues.

Next, let’s test the ability of Llama 3 8B to answer logical and arithmetic questions (derived from Hugging Face’s AQUA-RAT dataset—instead of multiple choice options, we ask for full answers) as follows:

Input:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
A board 7ft. 9 inches long is divided into 3 equal parts .
What is the length of each part??<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Output:
A nice simple problem!Let's convert the length of the board to inches:
7 ft 9 inches = 95 inches (since 1 foot = 12 inches).Since the board is
divided into 3 equal parts, each part will have the same length.
To find the length of each part, we'll divide the total length (95 inches)
by 3:95 inches ÷ 3 = 31.67 inches
So each part of the board is approximately 31.67 inches long.

This answer looks almost correct but not quite. The correct answer is 31 inches long. Similar logical questions are not answered correctly by the Llama 3 8B model.

In order for the Llama 3 8B model to improve its logical question answering capability, we want to fine-tune the model with data from the AQUA-RAT dataset. As we already mentioned, the AQUA-RAT dataset contains multiple choice options for the LLM to choose from. Because we don’t have the full answers for this dataset, we use the Llama 3.1 405B model to generate the verbal answer to the questions, and use that dataset to fine-tune the Llama 3 8B model.

Generate label data using Llama 3.1 405B

Because Llama 3.1 405B is the most capable of the Llama 3.1 collection of models, and because of its state-of-the-art math and general knowledge capabilities, we run direct inference of the questions in the AQUA-RAT dataset on Llama 3.1 405B using either SageMaker JumpStart or Amazon Bedrock. This helps us generate the answers we want to use to fine-tune the smaller Llama 3 8B models. In essence, we’re using Llama 3.1 405B as an alternative to human annotation to generate labels for the dataset. The following are example inference outputs from the 405B model:

Input:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
A board 7ft. 9 inches long is divided into 3 equal parts .
What is the length of each part??<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Output:
A nice simple problem!Let's break it down:The board is 7ft 9in long.
We can convert this to inches only:7ft = 7 × 12 = 84in (since there are
12 inches in a foot) Add the 9 inches: 84in + 9in = 93in
Now, the board is divided into 3 equal parts, so we need to divide the
total length (93in) by 3:93in ÷ 3 = 31inSo, each part is 31 inches long.

We can clearly see that the 405B answer is logically and mathematically correct, and we can use this answer in the place of human annotation for fine-tuning purposes. We concatenate the answers from the 405B model for approximately 2,000 questions from the AQUA-RAT dataset, which becomes our training dataset. Our test dataset comes from extracted answers for approximately 4,000 more questions.

The training data is constructed as a JSON object as follows:

{"instruction": "The entrance fee for a fair is $5 for persons under the age of 18, and 20% more for persons older. Each ride at the fair costs $0.50. If Joe goes with her 6 years old twin brothers, and they each took 3 rides in total. How much money does Joe end up spending at the fair?", "response": "The entrance fee for the 2 brothers is 5 x 2 = 10 dollars. The entrance fee for Joe is 5 x 1.2 = 6 dollars. So the total entrance fee is 10 + 6 = 16 dollars. The cost of the rides is 3 x 0.5 x 3 = 4.5 dollars. So the total cost is 16 + 4.5 = 20.5 dollars. The answer is 20.5."}

{"instruction": "John found that the average of 15 numbers is 40. If 10 is added to each number then the mean of number is?", "response": "The average of 15 numbers is 40. So the sum of the 15 numbers is 15 x 40 = 600. If 10 is added to each number, then the sum becomes 600 + 15 x 10 = 600 + 150 = 750. The average of the new numbers is 750 / 15 = 50.n#### 50"}

{"instruction": "A person is traveling at 20 km/hr and reached his destiny in 2.5 hr then find the distance?", "response": "A simple problem!nnLet's use the formula:nnDistance = Speed u00d7 TimennGiven:nSpeed (v) = 20 km/hrnTime (t) = 2.5 hrnnPlugging in the values, we get:nnDistance = 20 km/hr u00d7 2.5 hrn= 50 kmnnSo, the person traveled a distance of 50 km."}

The instruction and response dataset are then used to fine-tune the Llama 3 8B model in SageMaker JumpStart.

Fine-tune the Llama 3 8B model with the generated labels

For instructions on fine-tuning this model, refer to Fine-tune Code Llama on Amazon SageMaker JumpStart. We used a g5.12xlarge instance using the instruction fine-tuning option for five epochs, and observed that the loss function showed improvement both on the training and test sets.

When the fine-tuning was complete, we deployed the fine-tuned model for inference using SageMaker JumpStart on a g5.12xlarge instance (you can deploy on any other supported instance based on your throughput and latency requirements). Please refer to Meta’s End User License Agreement (EULA) for requirements when using synthetic-fine-tuned models for any use case.

Then we performed inference on another 1,000 unseen queries from the AQUA-RAT dataset. The following table shows a few of the queries for comparison.

Query	Answer	Llama 3 8B Responses Before Fine-Tuning	Llama 3 8B Responses After Fine-Tuning
Tim has 350 pounds of cement in 100, 50, and 25 pound bags. He has an equal number of each size bag. How many bags of cement does Tim have?	6	We know there are 3 types of bags. 100, 50, and 25 pounds. He has an equal number of each. So if x is the number of each type of bag, then you know 100x + 50x + 25x = 350. This is because each type of bag has x number of bags, and there is 100 pounds of cement per bag in the first case, 50 in the second, and 25 in the third. 175x = 350. So if we divide both sides by 175 we get x = 2. The answer is 2 bags of each.	Let’s assume Tim has X 100 pound bags, X 50 pound bags, and X 25 pound bags. He has in total 100X + 50X + 25X = 175X pounds of cement. Since he has 350 pounds total, we know 175X = 350, so X = 2. That means he has 2 bags of each type, so in total he has 2 + 2 + 2 = 6 bags. The answer is 6.
The purchase price of an article is $48. In order to include 30% of cost for overhead and to provide $12 of net profit, what should the markup be?	$55	To solve this problem, we can start by identifying the cost of the article, which is the purchase price of $48. We are also given that the markup should be enough to include 30% of the cost for overhead and provide a net profit of $12. Let’s break this down step by step: 1. Calculate the cost of the article: $48 (purchase price) 2. Calculate the overhead cost: 30% of the cost = 0.3 x $48 = $14.40 3. Calculate the total cost: cost + overhead cost = $48 + $14.40 = $62.40 4. Calculate the selling price: total cost + net profit = $62.40 + $12 = $74.40 Therefore, the markup should be $74.40 – $48 = $26.40. This means that the selling price should be $26.40 more than the purchase price to include 30% of the cost for overhead and provide a net profit of $12.	Step 1: Calculate the cost price of the articlenThe cost price of the article is $48.nn## Step 2: Calculate the overhead cost as a percentage of the cost pricenThe overhead cost is 30% of the cost price, which is 0.3 * $48 = $14.40.nn## Step 3: Calculate the selling price of the articlenThe selling price of the article is the cost price plus the overhead cost plus the net profit. The net profit is $12.nn## Step 4: Calculate the selling pricenThe selling price is $48 + $14.40 + $12 = $74.40.nn## Step 5: Calculate the markup as a percentage of the cost pricenThe markup is the difference between the selling price and the cost price, divided by the cost price. The markup is ($74.40 – $48) / $48 = $26.40 / $48 = 55%.nnThe final answer is: $55‘}
The slant height of a cone is 35 cm and radius of the base is 14cm, find the curved surface of the cone.	1540	A nice problem! The slant height of a cone is the distance from the center of the base to the vertex of the cone. Let’s call this distance “l”. We are given that the slant height is 35 cm, so: l = 35 cm The radius of the base is given as 14 cm, so the diameter of the base is 28 cm. The formula to find the curved surface area of a cone is: Curved Surface Area = πr(l + r) where r is the radius of the base and l is the slant height. Plugging in the values, we get: Curved Surface Area = π(14)(35 + 14) = π(14)(49) = 686π Now, we can simplify this expression by multiplying the numbers: Curved Surface Area = 686 × 3.14 (approximately) = 2153.24 cm² So, the curved surface area of the cone is approximately 2153.24 cm².	The formula for the curved surface area of a cone is: Curved Surface Area = πrl Where r is the radius of the base and l is the slant height. Given: r = 14 cm l = 35 cm Plugging in the values, we get: Curved Surface Area = π(14)(35) = 14π(35) = 490π = 490 × 3.14 = 1537.4 cm² Therefore, the curved surface area of the cone is approximately 1537.4 cm².
The speed at which a man can row a boat in still water is 25 kmph. If he rows downstream, where the speed of current is 11 kmph, what time will he take to cover 80 metres?	8 seconds	The speed of the man in still water is 25 kmph. The speed of the current is 11 kmph. So the speed downstream is 25 + 11 = 36 kmph. 80 meters is 0.02 km. So the time taken is 0.02 / 36 = 0.0056 hours. The answer is 5.6 minutes.	Speed of the boat in still water = 25 kmph Speed of the current = 11 kmph Speed downstream = 25 + 11 = 36 kmph Speed downstream in m/s = (36 * 5/18) = 10 m/s Distance = 80 m Time = Distance / Speed = 80 / 10 = 8 seconds.

We can qualitatively see that the answers for the math problems and logical questions are improved after fine-tuning the model. We will perform additional quantitative metric evaluation in future posts. This exercise demonstrates how you can use the Llama 3.1 405B model to efficiently generate datasets in an accelerated fashion and then use those datasets to significantly improve the task-specific capabilities of smaller models.

Conclusion

In this post, we showed how you can use the new Llama 3.1 405B model to synthesize and generate data labels to improve the performance of a much smaller model (Llama 3 8B in this case). We also showed that the responses generated by the fine-tuned model are much improved compared to the model without fine-tuning. We also provided the code notebook that you can use to run and test the solution.

As a next step, we encourage you to use this idea along with the Llama-3.1 405B model in your use case to generate labels or even unlabeled data that can then be used by a smaller model downstream to help solve your use case.

About the Authors

Sebastian Bustillo is an Enterprise Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through cloud technologies and AI/ML. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and riding his MTB.

Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Dr. Natarajan Chennimalai Kumar is a Principal Solutions Architect in the 3rd Party Model Provider team at AWS, working closely with the Llama partner engineering team at Meta to enable AWS customers use Meta’s Llama models. He holds a PhD from University of Illinois at Urbana-Champaign. He is based in the Bay Area in California. Outside of work, he enjoys watching shows with his kids, playing tennis, and traveling with his family.

Madhur Prashant is an AI and ML Solutions Architect at Amazon Web Services. He is passionate about the intersection of human thinking and generative AI. His interests lie in generative AI, specifically building solutions that are helpful and harmless, and most of all optimal for customers. Outside of work, he loves doing yoga, writing blogs, hiking, spending time with his twin, and playing the guitar.

Dr. Nikita Ivkin is a Senior Applied Scientist for Amazon SageMaker. He focuses on inference acceleration for foundation models and scalable ML algorithms in general. His research interests are in the area of inference acceleration, streaming algorithms, and federated learning, with publishing in a variety of machine learning and computer science venues such as NeurIPS, ICML, ICLR, STOC, PODS, and others.

Supriya Puragundla is a Senior Solutions Architect at AWS. She has over 15 years of IT experience in software development, design, and architecture. She helps key customer accounts on their data, generative AI, and AI/ML journeys. She is passionate about data-driven AI and the area of depth in ML and generative AI.

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, and KDD conferences, and Royal Statistical Society: Series A.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Karl Albertsen leads the product management and partnership teams for Amazon SageMaker. He is focused on making AI accessible, cost-effective, and high-performing for business applications.

Christopher Whitten is an SDE with the SageMaker JumpStart team leading model onboarding and deeper integration with SageMaker services. Chris is passionate about accelerating the ubiquity of AI in practical business applications. His technical interests include agentic workflows and MLOps.

Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his master’s from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.

Evan Kravitz is a software engineer at Amazon Web Services, working on SageMaker JumpStart. He is interested in the confluence of machine learning with cloud computing. Evan received his undergraduate degree from Cornell University and master’s degree from the University of California, Berkeley. In 2021, he presented a paper on adversarial neural networks at the ICLR conference. In his free time, Evan enjoys cooking, traveling, and going on runs in New York City.