Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

In recent years, FM sizes have been increasing. It is important to consider the massive amount of compute often required to train these models. The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia, custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

When using compute clusters of massive size, a single failure can often throw a training job off course and may require multiple hours of discovery and remediation from customers. According to a report from OPT-175B training, about 178,000 GPU hours were wasted due to various training failures, amounting to 16 percent of the total training time. Similarly, a study by Meta AI and Carnegie Melon university found that, in the worst cases, 43 percent of compute time was wasted because of overheads due to hardware failures. This can adversely impact a customer’s ability to keep up with the pace of innovation in generative AI and can also increase the time-to-market for their models.

Amazon SageMaker HyperPod is a service that is purpose-built to accelerate FM training, removing the undifferentiated heavy-lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks to months without disruption. To make FM training more resilient to hardware failures, SageMaker HyperPod continually monitors cluster health, repairs and replaces faulty nodes without disrupting training, and uses customer-defined checkpoints to automatically resume training from the last point of failure.

Why SageMaker HyperPod?

SageMaker HyperPod offers several benefits that make it a good choice for FM training:

  • Standby pool of nodes at no additional cost – SageMaker HyperPod provisions and manages a pool of spare nodes on the customer’s behalf. These nodes are on standby and can be automatically used to replace faulty nodes during training. This makes it so that failures don’t interrupt or delay large-scale training jobs, and these spare nodes come at no additional cost to the user. With the SageMaker HyperPod auto-resume functionality, the service can dynamically swap out unhealthy nodes for spare ones to ensure the seamless continuation of the workload.
  • Cluster placement groups for optimized training – Each instance group is launched in a cluster placement group within the same network spine, in order to get the best inter-node latency and maximize bandwidth between nodes. This is ideal for tightly coupled workloads like distributed training where low-latency communication is essential for synchronizing gradient updates and ensuring that model training scales effectively across multiple GPUs.
  • Preconfigured deep learning AMI with essential libraries – The SageMaker HyperPod agent runs a SageMaker HyperPod DLAMI, which is built on top of AWS Deep Learning Base GPU AMI (Ubuntu 20.04).The SageMaker HyperPod DLAMI is bundled with additional packages to support open source tools such as Slurm and dependencies. Also included are SageMaker HyperPod cluster software packages, which support features such as cluster health check and auto-resume.
  • Reusable scaling scripts for rapid experimentation – HyperPod offers a set of scalable and reusable scripts that simplify the process of launching multiple training runs. These scripts streamline infrastructure setup and deployment and can be easily adapted for different training scenarios or to run many jobs in parallel, making large-scale training more manageable. By reducing repetitive tasks and providing reusable automation, these scripts empower users to quickly scale up or down, test different model variations, and iterate faster, improving productivity and reducing operational overhead.
  • Auto-resume functionality – This is one of the most valuable features of SageMaker HyperPod. When a node fails, SageMaker HyperPod automatically replaces it with a healthy node from the spare pool and resumes the job from the last saved checkpoint with minimal disruption to training. This is particularly crucial for long-running training jobs, where even minor interruptions can lead to significant delays.
  • Real-time performance dashboards with few-click setup – SageMaker HyperPod integrates seamlessly with real-time dashboards to monitor node health, GPU utilization, network traffic, and other key metrics. This can be done with just a few clicks, providing full visibility into training jobs and allowing teams to optimize performance in real-time.

In this post, we present to you an in-depth guide to starting a continual pre-training job using PyTorch Fully Sharded Data Parallel (FSDP) for Mistral AI’s Mathstral model with SageMaker HyperPod. We review components of the Slurm orchestrated SageMaker HyperPod cluster setup, primarily focusing on the resiliency and feature set of SageMaker HyperPod, including automatic fault detection and integration with open source tools such as Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Overview of SageMaker HyperPod resiliency

Some of the health check metrics used by SageMaker HyperPod include:

  • Accelerator issues Checks for GPU issues including DCGM policies like XID errors, GPU health through nvidia-smi, and Trainium issues by reading from Neuron sysfs
  • Networking issues Elastic Fabric Adapter (EFA)
  • Health checks – Performed to run processes on accelerators and multiple threads on CPUs to achieve 100 percent utilization. This process determines the health of the CPU or accelerator. Specifically, DCGM Diagnostics Level 2 tests are run for GPUs, and CPU health is determined using the Linux stress tool.

SageMaker HyperPod continuously performs health checks on crucial components, including GPUs, AWS Trainium cores, and EFA networking devices. This proactive approach allows for the HyperPod health check agent to identify various hardware failures or potential performance degradation. When hardware failures are detected, SageMaker HyperPod identifies faulty instances and is also able to use its auto-resume functionality to initiate a replacement process without manual intervention. This feature automatically detects hardware failures, seamlessly replaces faulty instances, and resumes jobs from the last saved checkpoint. In addition, SageMaker HyperPod offers you the ability to manually replace a node in the case that you have a node stuck with an issue but is not being fixed by the SageMaker HyperPod auto-resume functionality. You can manually change the state of the node to fail, and SageMaker HyperPod will replace it with a healthy instance. For a more in-depth dive into resiliency with SageMaker HyperPod, refer to the Resiliency section of this post.

Overview of SageMaker HyperPod observability

To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, you can integrate your cluster with Amazon Managed Service for Prometheus and Amazon Managed Grafana. The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your SageMaker HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster’s behavior. By using these services, you gain a centralized and unified view of your SageMaker HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads. The Observability section of this post goes into more detail on which metrics are exported and what the dashboards look like in Amazon Managaed Grafana.

This post is primarily focused on Amazon Managed Service for Prometheus and Amazon Managed Grafana for observability. To explore more observability integrations with SageMaker HyperPod like Nvidia Nsight, refer to the validation and observability folder of the awsome-distributed-training GitHub repo.

These resiliency and observability features collectively contribute to a more reliable and efficient training environment, minimize downtime, and optimize resource usage. By directly integrating with Amazon Managed Service for Prometheus and Amazon Managed Grafana and abstracting the management of hardware failures and job resumption, SageMaker HyperPod allows data scientists and ML engineers to focus on model development rather than infrastructure management.

Mathstral model from Mistral AI

Mathstral is a model designed for math reasoning and scientific discovery, is based on the original Mistral 7B model, and features a 32k context window. The release of Mathstral aligns with Mistral AI’s broader effort to support academic and scientific research, particularly through their collaboration with Project Numina. As a 7B model, Mathstral sets a new standard on the performance and latency space for math and reasoning generation compared to similar models used for math and reasoning. Mathstral can achieve significantly better results with more inference-time computation.

Overview of PyTorch FSDP

In distributed data parallel (DDP) training, each process or worker owns a replica of the model and processes a batch of data. Then, it uses all-reduce to sum up gradients over different workers. In DDP, the model weights and optimizer states are replicated across all workers. DDP maintains a full copy of the model on each GPU and requires enough memory on each GPU to store the entire model. For training larger FMs, using an approach like FSDP is recommended, since these FMs require more than a single GPU. It is a type of data parallelism that shards model parameters, optimizer states, and gradients across DDP ranks. This approach reduces the memory requirements on individual GPUs and distributes the memory load across GPUs. With FSDP enhanced efficiency, researchers and developers use fewer GPUs, thereby minimizing operational costs and achieving faster model convergence.

When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes the training of some very large models feasible by allowing them to be loaded into memory with a lower memory footprint. However, this comes at the cost of increased communication volume. For more information on FSDP, refer to PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.

Solution overview

The following image shows the architecture diagram for the resources deployed as part of Sagemaker HyperPod for our use case of training the Mathstral model. In your account, you will have a VPC provisioned with a public and private subnet, and an S3 bucket synced to your FSxL file system via a data repository link. In the service team account, your cluster of P4de instances is provisioned, along with the head node, and the login node, for you to submit the training job to your cluster.

Prerequisites

In the context of this post, we use four p4de.24xlarge instances. You can find more information on the p4de.24xlarge instance type at Amazon EC2 P4 Instances. To get the best inter-node latency, we launch these instances together in a cluster and only run jobs on a single instance group. You can also use a variety of other instance types to follow along with this post.

For more information on getting access to instances in a partition group, refer to the Getting Started section in this post. Note that Mathstral 7B at full precision (FP32) is approximately 26 GB in size so you need to make sure that your cluster configuration has sufficient GPU memory to load the model into memory along with the gradients, activations, and moments. This should account for a total of 107 GB in addition to the training assets required to kick off a job successfully. For demonstration purposes, we use FSDP for this continued pre-training job.

The following sections describe setting up your infrastructure and environment with SageMaker HyperPod. For detailed instructions and code, we recommend that you follow along with the Amazon SageMaker HyperPod workshop. The prerequisites and cluster setup parts of this workshop go over all the required components needed in order to set up your cluster. The workshop also provides resources to troubleshoot commonly faced issues during setup.

Set up your infrastructure

Deploy HyperPod VPC stack

To set up your cluster, you first need to create some resources. The following resources can be created by deploying this SageMaker HyperPod VPC CloudFormation stack. By default usw2-az4 is specified as the Availability Zone. Change this to reflect the Availability Zone where you have your cluster. This VPC stack creates the following resources:

  • Subnet – This is a private subnet in the Availability Zone id that you choose to use
  • Security group – This allows SageMaker HyperPod to mount your Amazon FSx for Lustre file system
  • FSx for Lustre file system – This serves as the shared file system that all the nodes can access. It’s a 1.2 TB PERSISTENT_2 file system in the private subnet you create. It gets mounted at /fsx.
  • Linux environment – This provides a standardized development environment to work in
  • Amazon Simple Storage Service (Amazon S3) bucket – To push and store your lifecycle scripts
  • AWS Identity and Access Management (IAM) role – Role required for creating the SageMaker HyperPod cluster

Deploy the observability stack

In order to use the observability integration with SageMaker HyperPod, you need to deploy the SageMaker HyperPod Observability CloudFormation stack, which can then be used to monitor your cluster metrics in real time.

Set up your environment

Let’s move on to environment setup. In order to deploy this solution, you need to use a Linux-based development environment. This section briefly describes the steps required to set up your cluster. For detailed instructions and code, we recommend that you follow along with the Amazon SageMaker HyperPod workshop.

Set up your cluster

This section guides you through the process of deploying a cluster to train with. You need to set up the following:

  • Head node and compute nodes – The head node is composed of an m5.12xlarge instance, and the worker group consists of p4de.24xlarge instances. Refer to the following table for details on these instance types.
  • Shared volume – The cluster has an FSx for Lustre file system mounted at /fsx on both the head and compute nodes
  • Placement groups enabled – A placement group will launch instances close together inside one physical data center in a single Availability Zone to maximize the bandwidth and reduce the latency between instances
  • Local storage – Each node will have an 8 TB local NVME volume attached for local storage
  • Scheduler – SLURM will be used as a job scheduler
  • Accounting – As part of cluster configuration, a local MariaDB is deployed to keep track of job runtime information
A B C D E F
1 Instance size GPU devices Total GPU memory VCPUs CPU memory EFA bandwidth
2 p4de.24xlarge 8 640 gb 96 1152 gb 400 Gbps

Set up the AWS CLI

Before creating the cluster and its associated resources, you need to set up the AWS Command Line Interface (AWS CLI) using the latest version (or version 2.17.1 at a minimum).

To check the AWS CLI version, use the following command.

aws --version

To update the AWS CLI to the latest version, use the following command.

sudo ./aws/install –update

The AWS CLI plugin for Session Manager, a capability of AWS Systems Manager, must be installed to access your cluster. To use Amazon Linux 2 to install Session Manager, use the following command:

sudo yum install -y https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm

For detailed steps on installing and setting up the AWS CLI, follow the steps provided in the Install AWS CLI section of the Amazon SageMaker HyperPod workshop.

Source environment variables

An important part of the setup is to source in all the environment variables, using the output from the VPC CloudFormation stack deployed in a previous step. Use the following command.

curl 'https://static.us-east-1.prod.workshops.aws/public/e3e1b2f1-8140-43eb-a316-e76f569119dd/static/scripts/create_config.sh' --output create_config.sh
bash create_config.sh
source env_vars

Once you have sourced them in, confirm that they were correctly set using the following command.

cat env_vars

Set up lifecycle scripts

SageMaker HyperPod uses a collection of lifecycle scripts  to bootstrap the cluster. These scripts are responsible for several actions, including setting up Slurm and mounting the FSx for Lustre file system. You need to customize these scripts in order to mount your FSx for Lustre file system. For detailed steps on setting up these lifecycle scripts, refer to the Set Up Lifecycle Scripts section of the workshop.

Make sure to complete the Enable Optional Lifecycle scripts section after step 4 of the Set Up Lifecycle scripts section because this is needed in order to enable installation of exporter services on the cluster. This is required because you need the exporter services on the cluster to emit metrics to Amazon Managed Service for Prometheus.

Additionally, the observability stack requires the following two AWS managed IAM policies to be added to your AmazonSagemakerClusterExecutionRole prior to creating your cluster.

aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess

aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess

Once you have uploaded the lifecycle scripts to Amazon S3, you can then create your cluster.

Create your cluster

To create your cluster, you need your cluster configuration. Because you use p4de.24xlarge for this example, copy the following cluster configuration.

source env_vars
cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "InstanceGroups": [
      {
        "InstanceGroupName": "controller-machine",
        "InstanceType": "ml.m5.12xlarge",
        "InstanceStorageConfigs": [
          {
            "EbsVolumeConfig": {
              "VolumeSizeInGB": 500
            }
          }
        ],
        "InstanceCount": 1,
        "LifeCycleConfig": {
          "SourceS3Uri": "s3://${BUCKET}/src",
          "OnCreate": "on_create.sh"
        },
        "ExecutionRole": "${ROLE}",
        "ThreadsPerCore": 1
      },
      {
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4de.24xlarge",
        "InstanceCount": 4,
        "LifeCycleConfig": {
          "SourceS3Uri": "s3://${BUCKET}/src",
          "OnCreate": "on_create.sh"
        },
        "ExecutionRole": "${ROLE}",
        "ThreadsPerCore": 1
      }
    ],
    "VpcConfig": {
      "SecurityGroupIds": ["$SECURITY_GROUP"],
      "Subnets":["$SUBNET_ID"]
    }
}
EOL

If you use a different instance type for your cluster, refer to the Create Cluster section of the workshop to create your cluster-config.json file.

SageMaker HyperPod also gives you the ability to update your clusters to increase the size of an existing worker group or create a new worker group to add additional instance-types to your cluster. For steps on updating the cluster to create additional worker groups that use other instance types, refer to the section in the workshop to create Heterogenous Clusters.

Once you’ve created the cluster-config.json file, follow the Create Cluster steps in the workshop to create the FSX for Lustre configuration (provisioning_parameters.json) file and upload it to Amazon S3. Then, you can validate the configuration using the validate-config.py file in the awsome-distributed-training GitHub repo.

Once this validation is completed, you can create your cluster. Use the following command.

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json 
    --region $AWS_REGION

To check the state of your cluster, run the following command.

aws sagemaker list-clusters --output table

You should then be able to observe the cluster creating.

-----------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                          ListClusters                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
||                                                                       ClusterSummaries                                                          ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|
||                        ClusterArn                              |    ClusterName       | ClusterStatus |               CreationTime              ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|
|| arn:aws:sagemaker:us-west-2:{cluster arn}                      |  ml-cluster          | Creating      | time taken to create                    ||
|+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|

Now that you’ve created a cluster, you can monitor the status in the SageMaker console. This will show you cluster status, running instances, and node groups and allow you to modify the cluster. In the SageMaker HyperPod console, find your cluster and select it, as shown in the following screenshot.

Once the Cluster status changes to InService, you can connect using Secure Shell (SSH). Make sure that you completed the step in Set up the AWS CLI to install the SSM plugin. You can then take the easy-ssh.sh file from the repo to simplify the SSM command connect to the controller-machine using SSH.

curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
chmod +x easy-ssh.sh
./easy-ssh.sh -c controller-machine ml-cluster

Use the following command to switch to the ubuntu user.

sudo su - ubuntu

Refer to the Get to know your Cluster  section in the SageMaker HyperPod workshop to familiarize yourself with the commands you need to use in the later sections.

Finally, set up SSH access to the compute nodes. To do this, add a key-value pair to the /fsx/ubuntu directory. Because all the compute nodes mount this directory, you only have to do this once for ubuntu to access all the compute nodes. For instructions, refer to the SSH Access to compute section of the workshop.

Congrats on setting up your environment! Now that you’ve completed the necessary steps, you can move on to your training job.

Run your pre-training job

Follow these steps on your cluster head node:

  1. Navigate to your shared FSx for Lustre file system. If you followed the tutorial linked previously, it will be location at /fsx.
  2. Use the following command to clone the awsome-distributed-training repo.
cd /fsx
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/10.FSDP
  1. Run the create_conda_env.sh script.

This script will first download and install Miniconda, then create a Conda environment called pt_fsdp. The Conda environment installs PyTorch on AWS, which is a package that is built to run PyTorch workloads on AWS. Specifically, it lets you use EFA out of the box, since OFI-NCCL is pre-built in the Conda package. PyTorch on AWS also provides the latest versions of CUDA, cuDNN, and NCCL for the best performance on GPU-based instances. Dependencies required to run your FSDP training job will be installed in this Conda environment, and since this Conda environment is created on the /fsx file system, it’ll be shared across all your training nodes.

bash 0.create_conda_env.sh

For this training job, you use the C4 dataset, which is several hundred gigabytes. Instead of downloading the whole thing, the create_streaming_dataloaders function will stream the dataset from HuggingFace, so there’s no data prep required for running this training.

If want to use your own dataset instead, you can format it as a HuggingFace dataset and pass its location to the --dataset_path argument.

Launch training

The script to launch the Mathstral training job can be found in 3.distributed-training-mistral-mathstral.sbatch. Depending on the number of nodes in your cluster, you are can adjust them by modifying #SBATCH --nodes=4. Because you are using four p4de.24xlarge instances, it has been set to 4.

For the purpose of this post, you need to make sure that the FI_EFA variables for EFA are exported in the 3.distributed-training-mistral-mathstral.sbatch file. If you use instances not enabled for remote direct memory access (RDMA), such as the g5.12xlarge, comment out lines 21–22 of this file. These instances have EFA between nodes, but do not have the GPU direct RDMA access of p4d/e and p5 instances. In this walkthrough, we are using p4de instances, so we leave these lines uncommented.

## Plenty of EFA level variables
## Comment out for non-efa instances (G5, G4d, P3)
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4de
export FI_LOG_LEVEL=1
export FI_PROVIDER=efa
export NCCL_DEBUG=INFO

Under User Variables, make sure to adjust GPUS_PER_NODE to match the number of GPUs on your instance type (8 for p4de).

You can also adjust the training parameters in TRAINING_ARGS. Additional parameters can be found in model/arguments.py.

We use the same directory for both --checkpoint_dir and --resume_from_checkpoint. If there are multiple checkpoints, --resume_from_checkpoint will automatically select the most recent one. This way, if the training is interrupted for any reason, it will automatically pick up the most recent checkpoint.

Note: You may change these hyperparameters in the 3.distributed-training-mistral-mathstral.sbatch file. We are using arbitrary hyperparameters here for the sake of demonstration.

declare -a TRAINING_ARGS=(
    --train_batch_size=1 
    --val_batch_size=1 
    --max_steps=5000 
    --seed=42 
    --grad_clip=1.0 
    --weight_decay=0.2 
    --beta1=0.9 
    --beta2=0.95 
    --activation_checkpointing=1 
    --intermediate_size=14336 
    --num_key_value_heads=8 
    --logging_freq=1 
    --max_context_width=32768 
    --vocab_size=32768 
    --hidden_width=4096 
    --num_layers=32 
    --num_heads=32 
    --resid_pdrop=0.1 
    --embd_pdrop=0.1 
    --attn_pdrop=0.1 
    --summary_first_pdrop=0.1 
    --initializer_range=0.02 
    --model_type="mistral" 
    --rotary_pct=0.25 
    --rotary_emb_base=10000 
    --lr=0.0001 
    --lr_decay_style="cosine" 
    --min_lr=1e-5 
    --warmup=0.0032 
    --plateau=0.0 
    --dataset="c4" 
    --tokenizer="mistralai/mathstral-7B-v0.1" 
    --epochs=3 
    --checkpoint_dir="./checkpoints/mathstral-7B" 
    --resume_from_checkpoint="./checkpoints/mathstral-7B" 
    --checkpoint_freq=50 
    --validation_freq=500 
    --dataset_config_name="en" 
    --limit_all_gathers=1 
    --sharding_strategy="full"  # https://pytorch.org/docs/stable/fsdp.html
    --offload_activations=1
)

To launch your training, run the following command.

sbatch 3.distributed-training-mistral-mathstral.sbatch

You’ll find a new file in the FSDP directory of the form slurm-[job-number].out. This will be continuously updated with your training logs. Don’t be worried if you notice a long stream of NCCL logs (we prefer to use NCCL_DEBUG=INFO for verbose logging). After about a minute, you should observe your Mathstral model training, with an output similar to the following.

...
+ TORCHRUN=./pt_fsdp/bin/torchrun
+ export TRAIN_SCRIPT=./train.py
+ TRAIN_SCRIPT=./train.py
+ TRAINING_ARGS=(--train_batch_size=1 --val_batch_size=1 --max_steps=5000 --seed=42 --grad_clip=1.0 --weight_decay=0.2 --beta1=0.9 --beta2=0.95 --activation_checkpointing=1 --intermediate_size=14336 --num_key_value_heads=8 --logging_freq=1 --max_context_width=32768 --vocab_size=32768 --hidden_width=4096 --num_layers=32 --num_heads=32 --resid_pdrop=0.1 --embd_pdrop=0.1 --attn_pdrop=0.1 --summary_first_pdrop=0.1 --initializer_range=0.02 --model_type="mistral" --rotary_pct=0.25 --rotary_emb_base=10000 --lr=0.0001 --lr_decay_style="cosine" --min_lr=1e-5 --warmup=0.0032 --plateau=0.0 --dataset="c4" --tokenizer="mistralai/mathstral-7B-v0.1" --epochs=3 --checkpoint_dir="./checkpoints/mathstral-7B" --resume_from_checkpoint="./checkpoints/mathstral-7B" --checkpoint_freq=50 --validation_freq=500 --dataset_config_name="en" --limit_all_gathers=1 --sharding_strategy="full"  # https://pytorch.org/docs/stable/fsdp.html --offload_activations=1)
+ declare -a TRAINING_ARGS
+ AUTO_RESUME=
+ '[' -d /opt/sagemaker_cluster ']'
+ echo 'Detected Hyperpod cluster.. enabling --auto-resume=1'
Detected Hyperpod cluster.. enabling --auto-resume=1
+ AUTO_RESUME=--auto-resume=1
+ srun --auto-resume=1 -l ./pt_fsdp/bin/torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id=35 --rdzv_backend=c10d --rdzv_endpoint=ip-10-2-39-253 ./train.py --train_batch_size=1 --val_batch_size=1 --max_steps=5000 --seed=42 --grad_clip=1.0 --weight_decay=0.2 --beta1=0.9 --beta2=0.95 --activation_checkpointing=1 --intermediate_size=14336 --num_key_value_heads=8 --logging_freq=1 --max_context_width=32768 --vocab_size=32768 --hidden_width=4096 --num_layers=32 --num_heads=32 --resid_pdrop=0.1 --embd_pdrop=0.1 --attn_pdrop=0.1 --summary_first_pdrop=0.1 --initializer_range=0.02 --model_type=mistral --rotary_pct=0.25 --rotary_emb_base=10000 --lr=0.0001 --lr_decay_style=cosine --min_lr=1e-5 --warmup=0.0032 --plateau=0.0 --dataset=c4 --tokenizer=mistralai/mathstral-7B-v0.1 --epochs=3 --checkpoint_dir=./checkpoints/mathstral-7B --resume_from_checkpoint=./checkpoints/mathstral-7B --checkpoint_freq=50 --validation_freq=500 --dataset_config_name=en --limit_all_gathers=1 --sharding_strategy=full ' #' https://pytorch.org/docs/stable/fsdp.html --offload_activations=1
...
3: 2024-07-19 03:31:38 I [train.py:155] Creating Model
3: 2024-07-19 03:33:08 I [train.py:171] Created model with total parameters: 7248023552 (7.25 B)
3:...
3: 2024-07-19 03:33:23 I [train.py:209] Wrapped model with FSDP
3: 2024-07-19 03:33:23 I [train.py:226] Created optimizer
3: 2024-07-19 03:33:23 I [checkpoint.py:70] No Checkpoints Found
...
3: 2024-07-19 03:33:35 I [train.py:102] Batch 0 Loss: 11.19900, Speed: 5.10 samples/sec, lr: 0.000006
3: 2024-07-19 03:33:38 I [train.py:102] Batch 1 Loss: 11.18291, Speed: 10.96 samples/sec, lr: 0.000013
3: 2024-07-19 03:33:40 I [train.py:102] Batch 2 Loss: 11.09163, Speed: 11.22 samples/sec, lr: 0.000019
3: 2024-07-19 03:33:43 I [train.py:102] Batch 3 Loss: 10.86621, Speed: 11.19 samples/sec, lr: 0.000025
3: 2024-07-19 03:33:46 I [train.py:102] Batch 4 Loss: 10.58236, Speed: 11.12 samples/sec, lr: 0.000031
3: 2024-07-19 03:33:49 I [train.py:102] Batch 5 Loss: 10.08024, Speed: 11.18 samples/sec, lr: 0.000038
3: 2024-07-19 03:33:52 I [train.py:102] Batch 6 Loss: 10.15507, Speed: 11.23 samples/sec, lr: 0.000044
3: 2024-07-19 03:33:55 I [train.py:102] Batch 7 Loss: 9.97296, Speed: 10.42 samples/sec, lr: 0.000050
3: 2024-07-19 03:33:58 I [train.py:102] Batch 8 Loss: 10.13596, Speed: 11.21 samples/sec, lr: 0.000056
3: 2024-07-19 03:34:01 I [train.py:102] Batch 9 Loss: 9.93156, Speed: 11.10 samples/sec, lr: 0.000063

Observability

SageMaker HyperPod can optionally be integrated with Amazon Managed Service for Prometheus and Amazon Managed Grafana to export metrics about your cluster and cluster-nodes to an Amazon Managed Grafana dashboard.

For more details about configuring Amazon Managed Service for Prometheus and Amazon Managed Grafana, refer to the Prometheus Configuration and Amazon Managed Grafana sections in the SageMaker HyperPod workshop.

Slurm Exporter dashboard

The Amazon Managed Grafana Slurm dashboard (ID: 4323) provides visualization options for monitoring Slurm clusters. Prometheus Slurm exporter is installed on the controller node of the cluster. Some of the metrics exported include:

  • Cluster overview – Displays the total number of nodes, jobs, and their states
  • Job metrics – Visualizes job counts and states over time
  • Node metrics – Shows node states, allocation, and available resources
  • Partition metrics – Monitors partition-specific metrics such as CPU, memory, and GPU utilization
  • Job efficiency – Calculates job efficiency based on resources used

The following screenshot of the exporter dashboard shows the continued pre-training job for Mathstral being completed successfully.

Node Exporter dashboard

The Amazon Managed Grafana Node Exporter Full dashboard (ID: 1860) offers visualization options for monitoring system metrics collected by the Prometheus Node Exporter installed on the cluster nodes. Some of the key metrics you can visualize include:

  • System overview – Displays CPU load averages and memory usage
  • Memory metrics – Visualizes memory utilization including total memory, free memory, and swap space
  • Disk usage – Monitors disk space utilization and availability
  • Network traffic – Shows network bytes received and transmitted over time
  • File system metrics – Analyzes file system usage and availability
  • Disk I/O metrics – Visualizes disk read and write activity

DCGM Exporter dashboard

The Amazon Managed Grafana NVIDIA DCGM Exporter dashboard (ID: 12239) offers visualization options for monitoring NVIDIA GPU metrics collected by the DCGM Exporter. Some of the key metrics you can visualize include:

  • GPU overview – Displays GPU utilization, temperatures, power usage, and memory usage
  • Temperature metrics – Visualizes GPU temperatures over time
  • Power usage – Monitors GPU power draw and power usage trends
  • Memory utilization – Analyzes GPU memory usage, including used, free, and total memory
  • Fan speed – Shows GPU fan speeds and variations
  • ECC errors – Tracks GPU memory ECC errors and pending errors

EFA Metrics dashboard

The Amazon Managed Grafana EFA Metrics dashboard (ID: 20579) offers visualization options for monitoring EFA related metrics collected by the EFA Node Exporter. Some of the key visualizations include:

  • EFA error metrics – Visualizes errors such as allocation errors, command errors, and memory map errors
  • EFA network traffic – Monitors received and transmitted bytes, packets, and work requests
  • EFA RDMA performance – Analyzes RDMA read and write operations, including bytes transferred and error rates
  • EFA port lifespan – Displays the lifespan of EFA ports over time
  • EFA keep-alive packets – Tracks the number of keep-alive packets received

FSx Metrics dashboard

The Amazon Managed Grafana FSx for Lustre dashboard (ID: 20906) offers visualization options for monitoring Amazon FSx for Lustre file system related metrics collected by Amazon CloudWatch. Some of the key visualizations include:

  • DataReadBytes – The number of bytes for file system read operations
  • DataWriteBytes – The number of bytes for file system write operations
  • DataReadOperations – The number of read operations
  • DataWriteOperations – The number of write operations
  • MetadataOperations – The number of metadata operations
  • FreeDataStorageCapacity – The amount of available storage capacity

These metrics provide insights into various aspects of your FSx for Lustre file systems.

Resiliency

As mentioned previously, one of the value propositions of SageMaker HyperPod is that it provides a variety of cluster resiliency features such as cluster health checks, auto-resume, and the option to manually replace faulty nodes.

Based on the status of these health checks, SageMaker HyperPod detects whether nodes in the cluster are healthy or not. If a node is deemed unhealthy by any of the health checks, SageMaker HyperPod uses its auto-resume feature to automatically replace the faulty node, without any manual intervention.

Additionally, users have the option to implement checkpointing in their training procedure. Checkpointing, combined with auto-resume, means that once a faulty node is replaced, the training job can resume from the last saved checkpoint. This way, despite a hardware failure, a user’s training job can run with minimal loss in progress.

In this section, we demonstrate the resiliency and auto-resume feature of SageMaker HyperPod by simulating a hardware failure scenario and pointing you towards some logs that indicate the success of a replacement job. We use the same submitted FSDP training job, which has the following two important components enabled:

  1. Checkpointing is enabled and implemented
  2. The --auto-resume=1 flag is set. You can verify this in the SLURM .out

This section in the provided sbatch file sets the --auto-resume=1 flag.

AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
    echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
    AUTO_RESUME="--auto-resume=1"
fi
srun ${AUTO_RESUME} -l ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"

The sbatch file has the checkpointing flags checkpoint_freq, checkpoint_dir, resume_from_checkpoint, which tell the job how often to write checkpoints, where to write the checkpoints to, and what directory to read checkpoints from in case of failure, respectively.

Assuming that you already have your training job submitted, wait until a few checkpoints are written to the ./checkpoints directory (or the directory name you specified for checkpoint_freq. You can check whether any checkpoints were written by running ls -lt checkpoints/. This should return an output that resembles the following.

total 74
-rw-rw-r--  1 ubuntu ubuntu     1 Dec  9 00:21 latest_checkpointed_iteration.txt
drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:20 iter_0000002
drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:11 iter_0000001

You may also check the progress of your training job by running tail -f slurm-<job-id>.log, where <job-id> can be derived by running squeue. You should observe an output that resembles the following.

1:  iteration        1/  508626 | consumed samples:          288 | elapsed time per iteration (ms): 440352.6 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 4294967296.0 | number of skipped iterations:   1 | number of nan iterations:   0 |0: saving checkpoint at iteration       1 to /fsx/checkpoints
0:   successfully saved checkpoint at iteration       1 to /fsx/checkpoints
1: (min, max) time across ranks (ms):
1:     save-checkpoint ................................: (81611.24, 81611.82)

Once you’ve confirmed that your training job is running and that you have checkpoints written, you are ready to simulate a hardware failure.

As part of the output of running squeue, you have received an output that resembles the following.

JOBID PARTITION     NAME   USER ST  TIME  NODES NODELIST(REASON)
32          dev interact ubuntu  R  0:02      4  ip-10-2-9-98,...

This tells you what jobs are running and on what nodes. Locate your training job and choose any of the nodes except the first node on the list of nodes allocated to your job (this is the node that you will be injecting an error into). This is very important because PyTorch uses node 0 (that is, the first node) as the coordination node for your training job.

Once you’ve identified the node to inject the error onto, connect to it using SSH with following command.

ssh <NODE ip>

You can inject an ECC error by running the following command.

dcgmi test --inject --gpuid 0 -f 319 -v 4

This simulates a double-bit error (DBE) in the GPU of your chosen node. Additionally, to kill the training job to simulate a job failure, you can take the process id (PID) of any of the python processes running. The Python processes are the training job processes running your FSDP training job. The -9 flag here is the signal number for the SIGKILL signal, which forces a process to stop, without giving it a chance to clean up, or perform any other actions.

ps -aux | grep python
kill -9 <PID>

Once the ECC error is injected and the Python process has stopped, you can exit out of your compute node. In the meantime, you can get the output of the slurmctld.log file using the following command.

tail -f /var/log/slurm/slurmctld.log

In there, you can observe the following lines, which show a failed job or node.

 [2024-07-19T04:13:03.313] sched: Allocate JobId=35 NodeList-ip-10-2-39-253, ip-10-2-40-102, ip-10-2-76-26, ip-10-2-108-162 #CPUs=192 Partition=dev
 [2024-07-19T04:50:31.682] _slurm_rpc_submit_batch_job: JobId=35 InitPrio=1 usec=727
 [2024-07-19T04:50:31.803] update_node: node ip-10-2-39-253 reason set to: Action: Replace 
 [2024-07-19T04:50:31.803] update_node: node ip-10-2-39-253 state set to FAILING

Pay attention to the line that says update_node: node ip-10-2-39-253 reason set to: Action:Replace, which is the log that says that the node has failed and requires replacement.

If you look at your <slurm-job>.out file, you should observe logs like the following.

[Auto Resume] Info: JobID: 35 StepID: 0 Initiating communication with cluster agent to diagnose health of nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Response from cluster agent: JobID=35, ResumeAction=RETRYSTEP
[Auto Resume] Info: JobID: 35 StepID: 0 Job failed - replacing nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Job failed - Dropping unhealthy nodes
[Auto Resume] Info: JobID: 35 StepID: 0 Succesfully shrink job to retain healthy nodes ...
srun: job 35 queued and waiting for resources

This shows that job 35 (your training job) is paused and a new job (job 35) has initiated the replacement process. You can verify this by running squeue, where you will observe an auto-res. This is the auto-resume job that is initiated by SageMaker HyperPod to replace your faulty node.

JOBID PARTITION  NAME      USER ST    TIME NODES NODELIST(REASON)
35    dev    auto-res    ubuntu PD    0:00     4 (Resources)
...

You can also monitor your SageMaker HyperPod cluster using the AWS console. Under Instances, you should observe one of the nodes in worker-group-1 in Pending state, as shown in the following screenshot. This shows that the node is about to get replaced.

Once your node is replaced, you can observe the slurmctld.log file. Be on the alert for the following line:

update_node: node <YOUR-NODE-IP-ADDRESS> reason set to: AWS:Replaced

You can also verify that your node was successfully replaced using the HyperPod cluster tab in the Amazon SageMaker console.

Once your node is replaced, squeue should no longer display the auto-res job and should only display your original training job. The node is successfully replaced, without any manual intervention.

Because you enabled checkpointing, you can verify that the training job resumes from the latest checkpoint. In your <slurm-job>.out file, find the following lines, which show that a checkpoint was detected in the checkpoint directory (./checkpoints) and that the latest checkpoint was loaded, respectively.

...
Loading checkpoint from checkpoints/mathstral-10steps ...
...
Checkpoint loaded from checkpoints/mathstral-10steps ...
...

If you continue to monitor your <slurm-job>.out file, you should observe that your training job has resumed from the latest checkpoint.

Clean up

  1. To delete your cluster, enter the following command.
aws sagemaker delete-cluster --cluster-name ml-cluster

Once you are done deleting the cluster, make sure it is deleted in the SageMaker HyperPod clusters section under SageMaker.

  1. To use the console to delete your SageMaker HyperPod VPC and Observability CloudFormation stacks, follow the directions at Delete a stack from the CloudFormation console. Alternatively, use the AWS CLI by entering the following command. Replace my-stack with the name of your stacks.
aws cloudformation delete-stack 
    --stack-name my-stack

Conclusion

In this post, we provided a comprehensive guide on using Amazon SageMaker HyperPod for training large-scale models such as Mistral AI’s Mathstral using PyTorch Fully Sharded Data Parallel (FSDP). The process highlighted the efficiency of distributed training on SageMaker HyperPod, showcasing the critical role of resiliency and observability features in maintaining uninterrupted, scalable training environments.

Because of the integration with tools such as Amazon Managed Service for Prometheus and Amazon Managed Grafana for real-time monitoring, along with the robust cluster management capabilities of SageMaker HyperPod, ML practitioners can focus on model development rather than infrastructure management. The detailed steps for setting up the infrastructure, deploying the observability stack, and running a training job demonstrate how SageMaker HyperPod helps tackle the complexities of distributed training.

Moreover, the automatic health checks and the auto-resume feature significantly reduce downtime and minimize the impact of hardware failures so that large-scale training jobs can proceed with minimal interruptions. This level of resilience is crucial for maintaining the pace of innovation in AI research, especially when dealing with massive FMs.

By following the outlined procedures and using the powerful tools provided by AWS, data scientists and engineers can optimize their training workflows, reduce operational overhead, and accelerate the development of state-of-the-art models.

Getting Started

Interested in getting started with SageMaker HyperPod? Reach out to your AWS Account Team or email aws-frameworks-gtm@amazon.com. To begin experimenting with other examples on SageMaker HyperPod, refer to the awsome-distributed-training GitHub repo and the Amazon SageMaker HyperPod workshop.


About the Authors

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrate cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Web Services (AWS). He is partnering with key GenAI foundation model providers, AWS service teams, strategic customers, founders, universities, venture ecosystems, and Amazon to develop technology strategy that enables the next generation of artificial intelligence, machine learning, and accelerated computing on AWS.

Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on Gen AI model training and inference. He is partnering with top foundation model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop has held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Read More

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

This post has been co-written with Artem Sysuev, Danny Portman, Matúš Chládek, and Saurabh Gupta from Zeta Global.

Zeta Global is a leading data-driven, cloud-based marketing technology company that empowers enterprises to acquire, grow and retain customers. The company’s Zeta Marketing Platform (ZMP) is the largest omnichannel marketing platform with identity data at its core. The ZMP analyzes billions of structured and unstructured data points to predict consumer intent by using sophisticated artificial intelligence (AI) to personalize experiences at scale. For more information, see Zeta Global’s home page.

What Zeta has accomplished in AI/ML

In the fast-evolving landscape of digital marketing, Zeta Global stands out with its groundbreaking advancements in artificial intelligence. Zeta’s AI innovations over the past few years span 30 pending and issued patents, primarily related to the application of deep learning and generative AI to marketing technology. Using AI, Zeta Global has revolutionized how brands connect with their audiences, offering solutions that aren’t just innovative, but also incredibly effective. As an early adopter of large language model (LLM) technology, Zeta released Email Subject Line Generation in 2021. This tool enables marketers to craft compelling email subject lines that significantly boost open rates and engagement, tailored perfectly to the audience’s preferences and behaviors.

Further expanding the capabilities of AI in marketing, Zeta Global has developed AI Lookalikes. This technology allows companies to identify and target new customers who closely resemble their best existing customers, thereby optimizing marketing efforts and improving their return on investment (ROI). The backbone of these advancements is ZOE, Zeta’s Optimization Engine. ZOE is a multi-agent LLM application that integrates with multiple data sources to provide a unified view of the customer, simplify analytics queries, and facilitate marketing campaign creation. Together, these AI-driven tools and technologies aren’t just reshaping how brands perform marketing tasks; they’re setting new benchmarks for what’s possible in customer engagement.

In addition to its groundbreaking AI innovations, Zeta Global has harnessed Amazon Elastic Container Service (Amazon ECS) with AWS Fargate to deploy a multitude of smaller models efficiently.

Zeta’s AI innovation is powered by a proprietary machine learning operations (MLOps) system, developed in-house.

Context

In early 2023, Zeta’s machine learning (ML) teams shifted from traditional vertical teams to a more dynamic horizontal structure, introducing the concept of pods comprising diverse skill sets. This paradigm shift aimed to accelerate project delivery by fostering collaboration and synergy among teams with varied expertise. The need for a centralized MLOps platform became apparent as ML and AI applications proliferated across various teams, leading to a maze of maintenance complexities and hindering knowledge transfer and innovation.

To address these challenges, the organization developed an MLOps platform based on four key open-source tools: Airflow, Feast, dbt, and MLflow. Hosted on Amazon ECS with tasks run on Fargate, this platform streamlines the end-to-end ML workflow, from data ingestion to model deployment. This blog post delves into the details of this MLOps platform, exploring how the integration of these tools facilitates a more efficient and scalable approach to managing ML projects.

Architecture overview

Our MLOps architecture is designed to automate and monitor all stages of the ML lifecycle. At its core, it integrates:

  • Airflow for workflow orchestration
  • Feast for feature management
  • dbt for accelerated data transformation
  • MLflow for experiment tracking and model management

These components interact within the Amazon ECS environment, providing a scalable and serverless platform where ML workflows are run in containers using Fargate. This setup not only simplifies infrastructure management, but also ensures that resources are used efficiently, scaling up or down as needed.

The following figure shows the MLOps architecture.

Architectural deep dive

The following details dive deep into each of the components used in this architecture.

Airflow for workflow orchestration

Airflow schedules and manages complex workflows, defining tasks and dependencies in Python code. An example direct acyclic graph (DAG) might automate data ingestion, processing, model training, and deployment tasks, ensuring that each step is run in the correct order and at the right time.

Though it’s worth mentioning that Airflow isn’t used at runtime as is usual for extract, transform, and load (ETL) tasks. Every Airflow task calls Amazon ECS tasks with some overrides. Additionally, we’re using a custom Airflow operator called ECSTaskLogOperator that allows us to process Amazon CloudWatch logs using downstream systems.

model_training = ECSTaskLogOperator(
task_id= <...>,
task_definition= <...>,
cluster= <...>,
launch_type="FARGATE",
aws_conn_id= <...>,
overrides={
"containerOverrides": [
{
"name": " <...> ",
"environment": [
{
"name": "MLFLOW_TRACKING_URI",
"value": "<...>"
},
],
"command": ["mlflow", "run", <...>]
}
],
}

Feast for feature management

Feast acts as a central repository for storing and serving features, ensuring that models in both training and production environments use consistent and up-to-date data. It simplifies feature access for model training and inference, significantly reducing the time and complexity involved in managing data pipelines.

Additionally, Feast promotes feature reuse, so the time spent on data preparation is reduced greatly.

from datetime import timedelta
from feast import Entity, FeatureView, FeatureService, Field, SnowflakeSource
from feast.types import Float64

entities = [
Entity(name="site_id", join_keys=["SITE_ID"]),
Entity(name="user_id", join_keys=["USER_ID"]),
]

def create_feature_view(name, table, field_name, schema_name):
return FeatureView(
name=name,
entities=entities,
ttl=timedelta(days=30),
schema=[Field(name=field_name, dtype=Float64)],
source=SnowflakeSource(
database="<...>", schema="<...>", table=table, 
timestamp_field="<...>"
),
tags="<...>",
)

feature_view_1 = create_feature_view("<...>")
feature_view_2 = create_feature_view("<...>")

my_feature_service = FeatureService(
name="my_feature_servic ",
features=[feature_view_1, feature_view_1],
description=""" 

This is my Feature Service 

""",
owner="<...>",
)

dbt for data transformation

dbt is used for transforming data within the data warehouse, allowing data teams to define complex data models in SQL. It promotes a disciplined approach to data modeling, making it easier to ensure data quality and consistency across the ML pipelines. Moreover, it provides a straightforward way to track data lineage, so we can foresee which datasets will be affected by newly introduced changes. The following figure shows schema definition and model which reference it.

MLflow for experiment tracking and model management

MLflow tracks experiments and manages models. It provides a unified interface for logging parameters, code versions, metrics, and artifacts, making it easier to compare experiments and manage the model lifecycle.

Similarly to Airflow, MLflow is also used just partially. The main parts we use are tracking the server and model registry. From our experience, artifact server has some limitations, such as limits on artifact size (because of sending it using REST API). As a result, we opted to use it only partially.

We don’t extensively use the deployment capabilities of MLflow, because in our current setup, we build custom inference containers.

Hosting on Amazon ECS with Fargate

Amazon ECS offers a highly scalable and secure environment for running containerized applications. Fargate eliminates the need for managing underlying infrastructure, allowing us to focus solely on deploying and running the containers. This abstraction layer simplifies the deployment process, enabling seamless scaling based on workload demands while optimizing resource utilization and cost efficiency.

We found it optimal to run on Fargate components of our ML workflows that don’t require GPUs or distributed processing. These include dbt pipelines, data gathering jobs, training, evaluation, and batch inference jobs for smaller models.

Furthermore, Amazon ECS and Fargate seamlessly integrate with other AWS services, such as Amazon Elastic Container Registry (Amazon ECR) for container image management and AWS Systems Manager Parameter Store for securely storing and managing secrets and configurations. Using Parameter Store, we can centralize configuration settings, such as database connection strings, API keys, and environment variables, eliminating the need for hardcoding sensitive information within container images. This enhances security and simplifies maintenance, because secrets and configuration values can be dynamically retrieved by containers at runtime, ensuring consistency across deployments.

Moreover, integrating Amazon ECS and Fargate with CloudWatch enables comprehensive monitoring and logging capabilities for containerized tasks. This can be achieved by enabling the awslogs log driver within the logConfiguration parameters of the task definitions.

Why ECS with Fargate is the solution of choice

  1. Serverless model:
    • No infrastructure management: With Fargate, we don’t need to provision, configure, or manage servers. This simplifies operations and reduces the operational overhead, allowing teams to focus on developing and deploying applications.
    • Automatic scaling: Fargate automatically scales our applications based on demand, ensuring optimal performance without manual intervention.
  1. Cost efficiency:
    • Pay-as-we-go: Fargate charges are based on the resources (vCPU and memory) that the containers use. This model can be more cost-effective compared to maintaining idle resources.
    • No over provisioning: Because we only pay for what we use, there’s no need to over-provision resources, which can lead to cost savings.
  1. Enhanced security:
    • Isolation: Each Fargate task runs in its own isolated environment, improving security. There’s no sharing of underlying compute resources with other tenants.
  1. Integration with the AWS ecosystem:

Configuring Amazon ECS with Fargate for ML workloads

Configuring Amazon ECS with Fargate for ML workloads involves the following steps.

  1. Docker images: ML models and applications are containerized using Docker. This includes all dependencies, libraries, and configurations needed to run the ML workload.
  2. Creating task definitions:
    • Define resources: Create an Amazon ECS task definition specifying the Docker image, required vCPU, memory, and other configurations.
    • Environment variables: Set environment variables, such as model paths, API keys, and other necessary parameters.
  1. IAM roles: Assign appropriate AWS Identity and Access Management (IAM) roles to the tasks for accessing other AWS resources securely.
  2. Logging using CloudWatch: Use CloudWatch for logging and monitoring the performance and health of ML workloads.

Future development and addressing emerging challenges

As the field of MLOps continues to evolve, it’s essential to anticipate and address upcoming challenges to ensure that the platform remains efficient, scalable, and user-friendly. Two primary areas of future development for our platform include:

  1. Enhancing bring your own model (BYOM) capabilities for external clients
  2. Reducing the learning curve for data scientists

This section outlines those challenges and proposes directions for future enhancements.

Enhancing BYOM capabilities

As machine learning becomes more democratized, there is a growing need for platforms to easily integrate models developed externally by Zeta’s clients.

Future directions:

  • Developing standardized APIs: Implement APIs that allow for easy integration of external models, regardless of the framework or language they were developed in. This would involve creating a set of standardized interfaces for model ingestion, validation, and deployment.
  • Creating a model adapter framework: Design a framework that can adapt external models to be compatible with the platform’s infrastructure, ensuring that they can be managed, tracked, and deployed just like internally developed models.
  • Enhancing documentation and support: Provide comprehensive documentation and support resources to guide external clients through the BYOM process, including best practices for model preparation, integration, and optimization.

Reducing the learning curve for data scientists

The incorporation of multiple specialized tools (Airflow, Feast, dbt, and MLflow) into the MLOps pipeline can present a steep learning curve for data scientists, potentially hindering their productivity and the overall efficiency of the ML development process.

Future directions:

We’ll do the following to help reduce the learning curve:

  • Creating unified interfaces: Develop a unified interface, including UI, API, and SDK, that abstracts away the complexities of interacting with each tool individually. This interface could provide simplified workflows, automating routine tasks and presenting a cohesive view of the entire ML lifecycle.
  • Offering comprehensive training and resources: Invest in training programs and resources tailored to data scientists at different skill levels. This could include interactive tutorials, workshops, and detailed case studies showcasing real-world applications of the platform.

Conclusion

Integrating Airflow, Feast, dbt, and MLflow into an MLOps platform hosted on Amazon ECS with AWS Fargate presents a robust solution for managing the ML lifecycle. This setup not only streamlines operations but also enhances scalability and efficiency, allowing data science teams to focus on innovation rather than infrastructure management.

Additional Resources

For those looking to dive deeper, we recommend exploring the official documentation and tutorials for each tool: Airflow, Feast, dbt, MLflow) and Amazon ECS. These resources are invaluable for understanding the capabilities and configurations of each component in our MLOps platform.


About the authors

Varad Ram holds the position of Senior Solutions Architect at Amazon Web Services. He possesses extensive experience encompassing application development, cloud migration strategies, and information technology team management. Recently, his primary focus has shifted towards assisting clients in navigating the process of productizing generative artificial intelligence use cases.

Artem Sysuev is a Lead Machine Learning Engineer at Zeta, passionate about creating efficient, scalable solutions. He believes that effective processes are key to success, which led him to focus on both machine learning and MLOps. Starting with machine learning, Artem developed skills in building predictive models. Over time, he saw the need for strong operational frameworks to deploy and maintain these models at scale, which drew him to MLOps. At Zeta, he drives innovation by automating workflows and improving collaboration, ensuring smooth integration of machine learning models into production systems.

Saurabh Gupta is a Principal Engineer at Zeta Global. He is passionate about machine learning engineering, distributed systems, and big-data technologies. He has built scalable platforms that empower data scientists and data engineers, focusing on low-latency, resilient systems that streamline workflows and drive innovation. He holds a B.Tech degree in Electronics and Communication Engineering from the Indian Institute of Technology (IIT), Guwahati, and has deep expertise in designing data-driven solutions that support advanced analytics and machine learning initiatives.

Matúš Chládek is a Senior Engineering Manager for ML Ops at Zeta Global. With a career that began in Data Science, Matúš has developed a strong foundation in analytics and machine learning. Over the years, Matúš transitioned into more engineering-focused roles, eventually becoming a Machine Learning Engineer before moving into Engineering Management. Matúš’s leadership focuses on building robust, scalable infrastructure that streamlines workflows and supports rapid iteration and production-ready delivery of machine learning projects. Matúš is passionate about driving innovation at the intersection of Data Science and Engineering, making advanced analytics accessible and scalable for internal users and clients alike.

Dr. Danny Portman is a recognized thought leader in AI and machine learning, with over 30 patents focused on Deep Learning and Generative AI applications in advertising and marketing technology. He holds a Ph.D. in Computational Physics, specializing in high-performance computing models for simulating complex astrophysical systems. With a strong background in quantitative research, Danny brings a wealth of experience in applying data-driven approaches to solve problems across various sectors. As VP of Data Science and Head of AI/ML at Zeta Global, Dr. Portman leads the development of AI-driven products and strategies, and spearheads the company’s cutting-edge Generative AI R&D efforts to deliver innovative solutions for marketers.

Read More

Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

Build RAG-based generative AI applications in AWS using Amazon FSx for NetApp ONTAP with Amazon Bedrock

The post is co-written with Michael Shaul and Sasha Korman from NetApp.

Generative artificial intelligence (AI) applications are commonly built using a technique called Retrieval Augmented Generation (RAG) that provides foundation models (FMs) access to additional data they didn’t have during training. This data is used to enrich the generative AI prompt to deliver more context-specific and accurate responses without continuously retraining the FM, while also improving transparency and minimizing hallucinations.

In this post, we demonstrate a solution using Amazon FSx for NetApp ONTAP with Amazon Bedrock to provide a RAG experience for your generative AI applications on AWS by bringing company-specific, unstructured user file data to Amazon Bedrock in a straightforward, fast, and secure way.

Our solution uses an FSx for ONTAP file system as the source of unstructured data and continuously populates an Amazon OpenSearch Serverless vector database with the user’s existing files and folders and associated metadata. This enables a RAG scenario with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data retrieved from the OpenSearch Serverless vector database.

When developing generative AI applications such as a Q&A chatbot using RAG, customers are also concerned about keeping their data secure and preventing end-users from querying information from unauthorized data sources. Our solution also uses FSx for ONTAP to allow users to extend their current data security and access mechanisms to augment model responses from Amazon Bedrock. We use FSx for ONTAP as the source of associated metadata, specifically the user’s security access control list (ACL) configurations attached to their files and folders and populate that metadata into OpenSearch Serverless. By combining access control operations with file events that notify the RAG application of new and changed data on the file system, our solution demonstrates how FSx for ONTAP enables Amazon Bedrock to only use embeddings from authorized files for the specific users that connect to our generative AI application.

AWS serverless services make it straightforward to focus on building generative AI applications by providing automatic scaling, built-in high availability, and a pay-for-use billing model. Event-driven compute with AWS Lambda is a good fit for compute-intensive, on-demand tasks such as document embedding and flexible large language model (LLM) orchestration, and Amazon API Gateway provides an API interface that allows for pluggable frontends and event-driven invocation of the LLMs. Our solution also demonstrates how to build a scalable, automated, API-driven serverless application layer on top of Amazon Bedrock and FSx for ONTAP using API Gateway and Lambda.

Solution overview

The solution provisions an FSx for ONTAP Multi-AZ file system with a storage virtual machine (SVM) joined to an AWS Managed Microsoft AD domain. An OpenSearch Serverless vector search collection provides a scalable and high-performance similarity search capability. We use an Amazon Elastic Compute Cloud (Amazon EC2) Windows server as an SMB/CIFS client to the FSx for ONTAP volume and configure data sharing and ACLs for the SMB shares in the volume. We use this data and ACLs to test permissions-based access to the embeddings in a RAG scenario with Amazon Bedrock.

The embeddings container component of our solution is deployed on an EC2 Linux server and mounted as an NFS client on the FSx for ONTAP volume. It periodically migrates existing files and folders along with their security ACL configurations to OpenSearch Serverless. It populates an index in the OpenSearch Serverless vector search collection with company-specific data (and associated metadata and ACLs) from the NFS share on the FSx for ONTAP file system.

The solution implements a RAG Retrieval Lambda function that allows RAG with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data and associated metadata (including ACLs) retrieved from the OpenSearch Serverless index that was populated by the embeddings container component. The RAG Retrieval Lambda function stores conversation history for the user interaction in an Amazon DynamoDB table.

End-users interact with the solution by submitting a natural language prompt either through a chatbot application or directly through the API Gateway interface. The chatbot application container is built using Streamlit and fronted by an AWS Application Load Balancer (ALB). When a user submits a natural language prompt to the chatbot UI using the ALB, the chatbot container interacts with the API Gateway interface that then invokes the RAG Retrieval Lambda function to fetch the response for the user. The user can also directly submit prompt requests to API Gateway and obtain a response. We demonstrate permissions-based access to the RAG documents by explicitly retrieving the SID of a user and then using that SID in the chatbot or API Gateway request, where the RAG Retrieval Lambda function then matches the SID to the Windows ACLs configured for the document. As an additional authentication step in a production environment, you may want to also authenticate the user against an identity provider and then match the user against the permissions configured for the documents.

The following diagram illustrates the end-to-end flow for our solution. We start by configuring data sharing and ACLs with FSx for ONTAP, and then these are periodically scanned by the embeddings container. The embeddings container splits the documents into chunks and uses the Amazon Titan Embeddings model to create vector embeddings from these chunks. It then stores these vector embeddings with associated metadata in our vector database by populating an index in a vector collection in OpenSearch Serverless. The following diagram illustrates the end-to-end flow.

end to end embedding flow for the fsxontap and bedrock integration

The following architecture diagram illustrates the various components of our solution.overall architecture diagram describing all the components of the solution

Prerequisites

Complete the following prerequisite steps:

  1. Make sure you have model access in Amazon Bedrock. In this solution, we use Anthropic Claude v3 Sonnet on Amazon Bedrock.
  2. Install the AWS Command Line Interface (AWS CLI).
  3. Install Docker.
  4. Install Terraform.

Deploy the solution

The solution is available for download on this GitHub repo. Cloning the repository and using the Terraform template will provision all the components with their required configurations.

  1. Clone the repository for this solution:
    sudo yum install -y unzip
    git clone https://github.com/aws-samples/genai-bedrock-fsxontap.git
    cd genai-bedrock-fsxontap/terraform

  2. From the terraform folder, deploy the entire solution using Terraform:
    terraform init
    terraform apply -auto-approve

This process can take 15–20 minutes to complete. When finished, the output of the terraform commands should look like the following:

api-invoke-url = "https://9ng1jjn8qi.execute-api.<region>.amazonaws.com/prod"
fsx-management-ip = toset([
"198.19.255.230",])
fsx-secret-id = "arn:aws:secretsmanager:<region>:<account-id>:secret:AmazonBedrock-FSx-NetAPP-ONTAP-a2fZEdIt-0fBcS9"
fsx-svm-smb-dns-name = "BRSVM.BEDROCK-01.COM"
lb-dns-name = "chat-load-balancer-2040177936.<region>.elb.amazonaws.com"

Load data and set permissions

To test the solution, we will use the EC2 Windows server (ad_host) mounted as an SMB/CIFS client to the FSx for ONTAP volume to share sample data and set user permissions that will then be used to populate the OpenSearch Serverless index by the solution’s embedding container component. Perform the following steps to mount your FSx for ONTAP SVM data volume as a network drive, upload data to this shared network drive, and set permissions based on Windows ACLs:

  1. Obtain the ad_host instance DNS from the output of your Terraform template.
  2. Navigate to AWS Systems Manager Fleet Manager on your AWS console, locate the ad_host instance and follow instructions here to login with Remote Desktop. Use the domain admin user bedrock-01Admin and obtain the password from AWS Secrets Manager. You can find the password using the Secrets Manager fsx-secret-id secret id from the output of your Terraform template.
  3. To mount an FSx for ONTAP data volume as a network drive, under This PC, choose (right-click) Network and then choose Map Network drive.
  4. Choose the drive letter and use the FSx for ONTAP share path for the mount
    (\<svm>.<domain >c$<volume-name>):
    map network drive
  5. Upload the Amazon Bedrock User Guide to the shared network drive and set permissions to the admin user only (make sure that you disable inheritance under Advanced):upload the amazon bedrock user guide
  6. Upload the Amazon FSx for ONTAP User Guide to the shared drive and make sure permissions are set to Everyone:upload the amazon fsx ontap media guide
  7. On the ad_host server, open the command prompt and enter the following command to obtain the SID for the admin user:
    wmic useraccount where name='Admin' get sid

Test permissions using the chatbot

To test permissions using the chatbot, obtain the lb-dns-name URL from the output of your Terraform template and access it through your web browser:

test with chatbot and enter prompt

For the prompt query, ask any general question on the FSx for ONTAP user guide that is available for access to everyone. In our scenario, we asked “How can I create an FSx for ONTAP file system,” and the model replied back with detailed steps and source attribution in the chat window to create an FSx for ONTAP file system using the AWS Management Console, AWS CLI, or FSx API:

test with chatbot and enter prompt related to the bedrock guide

Now, let’s ask a question about the Amazon Bedrock user guide that is available for admin access only. In our scenario, we asked “How do I use foundation models with Amazon Bedrock,” and the model replied with the response that it doesn’t have enough information to provide a detailed answer to the question.:

Use the admin SID on the user (SID) filter search in the chat UI and ask the same question in the prompt. This time, the model should reply with steps detailing how to use FMs with Amazon Bedrock and provide the source attribution used by the model for the response:

Test permissions using API Gateway

You can also query the model directly using API Gateway. Obtain the api-invoke-url parameter from the output of your Terraform template.

curl -v '<api-invoke-url>/bedrock_rag_retreival' -X POST -H 'content-type: application/json' -d '{"session_id": "1","prompt": "What is an FSxN ONTAP filesystem?", "bedrock_model_id": "anthropic.claude-3-sonnet-20240229-v1:0", "model_kwargs": {"temperature": 1.0, "top_p": 1.0, "top_k": 500}, "metadata": "NA", "memory_window": 10}'

Then invoke the API gateway with Everyone access for a query related to the FSx for ONTAP user guide by setting the value of the metadata parameter to NA to indicate Everyone access:

curl -v '<api-invoke-url>/bedrock_rag_retreival' -X POST -H 'content-type: application/json' -d '{"session_id": "1","prompt": "what is bedrock?", "bedrock_model_id": "anthropic.claude-3-sonnet-20240229-v1:0", "model_kwargs": {"temperature": 1.0, "top_p": 1.0, "top_k": 500}, "metadata": "S-1-5-21-4037439088-1296877785-2872080499-1112", "memory_window": 10}'

Cleanup

To avoid recurring charges, clean up your account after trying the solution. From the terraform folder, delete the Terraform template for the solution:

terraform apply --destroy

Conclusion

In this post, we demonstrated a solution that uses FSx for ONTAP with Amazon Bedrock and uses FSx for ONTAP support for file ownership and ACLs to provide permissions-based access in a RAG scenario for generative AI applications. Our solution enables you to build generative AI applications with Amazon Bedrock where you can enrich the generative AI prompt in Amazon Bedrock with your company-specific, unstructured user file data from an FSx for ONTAP file system. This solution enables you to deliver more relevant, context-specific, and accurate responses while also making sure only authorized users have access to that data. Finally, the solution demonstrates the use of AWS serverless services with FSx for ONTAP and Amazon Bedrock that enable automatic scaling, event-driven compute, and API interfaces for your generative AI applications on AWS.

For more information about how to get started building with Amazon Bedrock and FSx for ONTAP, refer to the following resources:


About the authors

Kanishk Mahajan is Principal, Solutions Architecture at AWS. He leads cloud transformation and solution architecture for ISV customers and partner at AWS. Kanishk specializes in containers, cloud operations, migrations and modernizations, AI/ML, resilience and security and compliance. He is a Technical Field Community (TFC) member in each of those domains at AWS.

Michael Shaul is a Principal Architect at NetApp’s office of the CTO. He has over 20 years of experience building data management systems, applications, and infrastructure solutions. He has a unique in-depth perspective on cloud technologies, builder, and AI solutions.

Sasha Korman is a tech visionary leader of dynamic development and QA teams across Israel and India. With 14-years at NetApp that began as a programmer, his hands-on experience and leadership have been pivotal in steering complex projects to success, with a focus on innovation, scalability, and reliability.

Read More

Support for AWS DeepComposer ending soon

Support for AWS DeepComposer ending soon

AWS DeepComposer was first introduced during AWS re:Invent 2019 as a fun way for developers to compose music by using generative AI. AWS DeepComposer was the world’s first machine learning (ML)-enabled keyboard for developers to get hands-on—literally—with a musical keyboard and the latest ML techniques to compose their own music.

After careful consideration, we have made the decision to end support for AWS DeepComposer, effective September 17, 2025. With your help and feedback, our portfolio of products and services has grown to include new tools for developers to get hands-on with AI and ML. Amazon PartyRock, for example, is a generative AI playground for intuitive, code-free help in building web applications.

If you have data stored on the AWS DeepComposer console, you will be able to use AWS DeepComposer as normal until September 17, 2025, when support for the service will end. After this date, you will no longer be able to use AWS DeepComposer through the AWS Management Console, manage AWS DeepComposer devices, or access any compositions or models you have created. Until then, you can continue to work on your compositions or models and export those you would like to keep by using the step-by-step guide in the AWS DeepComposer FAQs.

If you have additional questions, please read our FAQs or contact us.


About the author

Kanchan Jagannathan is a Sr. Program Manager in the AWS AI Devices team where he helps launches AWS devices into sales channel and also oversees the Service Availability Change process for the team. He was a Program Manager for FC automation deployment and launches before joining AWS. Outside of work, he has begun to bravely endeavour camping with his 5-yr old and 1-yr old kids and enjoying the moments he gets to be with them.

Read More

Preserve access and explore alternatives for Amazon Lookout for Equipment

Preserve access and explore alternatives for Amazon Lookout for Equipment

Amazon Lookout for Equipment, the AWS machine learning (ML) service designed for industrial equipment predictive maintenance, will no longer be open to new customers effective October 17, 2024. Existing customers will be able to use the service (both using the AWS Management Console and API) as normal and AWS will continue to invest in security, availability, and performance improvements for Lookout for Equipment, but we do not plan to introduce new features for this service.

This post discusses how you can maintain access to Lookout for Equipment after it is closed to new customers and some alternatives to Lookout for Equipment.

Maintaining access to Lookout for Equipment

You’re considered an existing customer if you use the service, either through cloud training or cloud inferencing, any time in the 30 days prior to October 17, 2024 (September 17, 2024, through October 16, 2024). To maintain access to the service after October 17, 2024, you should complete one of the following tasks from the account for which you intend to maintain access:

  • On the Lookout for Equipment console, start a new project and successfully complete a model training
  • On the Lookout for Equipment console, open an existing project, schedule an inference for a given model, and run at least one inference
  • Use Lookout for Equipment API calls CreateInferenceScheduler and StartInferenceScheduler (and StopInferenceScheduler when done)

For any questions or support needed, contact your assigned AWS Account Manager or Solutions Architect, or create a case from the AWS console.

Alternatives to Lookout for Equipment

If you’re interested in an alternative to Lookout for Equipment, AWS has options for both buyers and builders.

For an out-of-the-box solution, the AWS Partner Network offers solutions from multiple partners. You can browse solutions on the Asset Maintenance and Reliability page in the AWS Solutions Library. This approach provides a solution that addresses your use case without requiring you to have expertise in predictive maintenance, and typically provides the fastest time to value by using the specialized expertise of the AWS Partners.

If you prefer to build your own solution, AWS offers AI/ML tools and services to help you develop an AI-based predictive maintenance solution. Amazon SageMaker provides a set of tools to enable you to build, train, infer, and deploy ML models for your use case with fully managed infrastructure, tools, and workflows.

Summary

Although new customers will no longer have access to Lookout for Equipment after October 17, 2024, AWS offers a powerful set of AI/ML services and solutions in the form of SageMaker tools to build customer models, and also offers a range of solutions from partners through the AWS Partner Network. You should explore these options to determine what works best for your specific needs.

For more details, refer to the following resources:


About the author

Stuart Gillen is a Sr. Product Manager, Lookout for Equipment, at AWS. Stuart has held a variety of roles in engineering management, business development, product management, and consulting. Most of his career has been focused on industrial applications specifically in reliability practices, maintenance systems, and manufacturing.
Stuart is the Product Manager for Lookout for Equipment at AWS where he utilizes his industrial and AI background in applications focusing on Predictive Maintenance and Condition Monitoring.

Read More

CRISPR-Cas9 guide RNA efficiency prediction with efficiently tuned models in Amazon SageMaker

CRISPR-Cas9 guide RNA efficiency prediction with efficiently tuned models in Amazon SageMaker

The clustered regularly interspaced short palindromic repeat (CRISPR) technology holds the promise to revolutionize gene editing technologies, which is transformative to the way we understand and treat diseases. This technique is based in a natural mechanism found in bacteria that allows a protein coupled to a single guide RNA (gRNA) strand to locate and make cuts in specific sites in the targeted genome. Being able to computationally predict the efficiency and specificity of gRNA is central to the success of gene editing.

Transcribed from DNA sequences, RNA is an important type of biological sequence of ribonucleotides (A, U, G, C), which folds into 3D structure. Benefiting from recent advance in large language models (LLMs), a variety of computational biology tasks can be solved by fine-tuning biological LLMs pre-trained on billions of known biological sequences. The downstream tasks on RNAs are relatively understudied.

In this post, we adopt a pre-trained genomic LLMs for gRNA efficiency prediction. The idea is to treat a computer designed gRNA as a sentence, and fine-tune the LLM to perform sentence-level regression tasks analogous to sentiment analysis. We used Parameter-Efficient Fine-Tuning methods to reduce the number of parameters and GPU usage for this task.

Solution overview

Large language models (LLMs) have gained a lot of interest for their ability to encode syntax and semantics of natural languages. The neural architecture behind LLMs are transformers, which are comprised of attention-based encoder-decoder blocks that generate an internal representation of the data they are trained from (encoder) and are able to generate sequences in the same latent space that resemble the original data (decoder). Due to their success in natural language, recent works have explored the use of LLMs for molecular biology information, which is sequential in nature.

DNABERT is a pre-trained transformer model with non-overlapping human DNA sequence data. The backbone is a BERT architecture made up of 12 encoding layers. The authors of this model report that DNABERT is able to capture a good feature representation of the human genome that enables state-of-the-art performance on downstream tasks like promoter prediction and splice/binding site identification. We decided to use this model as the foundation for our experiments.

Despite the success and popular adoption of LLMs, fine-tuning these models can be difficult because of the number of parameters and computation necessary for it. For this reason, Parameter-Efficient Fine-Tuning (PEFT) methods have been developed. In this post, we use one of these methods, called LoRA (Low-Rank Adaptation). We introduce the method in the following sections.

The following diagram is a representation of the Cas9 DNA target mechanism. The gRNA is the component that helps target the cleavage site.

The goal of this solution is to fine-tune a base DNABERT model to predict activity efficiency from different gRNA candidates. As such, our solution first takes gRNA data and processes it, as described later in this post. Then we use an Amazon SageMaker notebook and the Hugging Face PEFT library to fine-tune the DNABERT model with the processed RNA data. The label we want to predict is the efficiency score as it was calculated in experimental conditions testing with the actual RNA sequences in cell cultures. Those scores describe a balance between being able to edit the genome and not damage DNA that wasn’t targeted.

The following diagram illustrates the workflow of the proposed solution.

Prerequisites

For this solution, you need access to the following:

  • A SageMaker notebook instance (we trained the model on an ml.g4dn.8xlarge instance with a single NVIDIA T4 GPU)
  • transformers-4.34.1
  • peft-0.5.0
  • DNABERT 6

Dataset

For this post, we use the gRNA data released by researchers in a paper about gRNA prediction using deep learning. This dataset contains efficiency scores calculated for different gRNAs. In this section, we describe the process we followed to create the training and evaluation datasets for this task.

To train the model, you need a 30-mer gRNA sequence and efficiency score. A k-mer is a contiguous sequence of k nucleotide bases extracted from a longer DNA or RNA sequence. For example, if you have the DNA sequence “ATCGATCG” and you choose k = 3, then the k-mers within this sequence would be “ATC,” “TCG,” “CGA,” “GAT,” and “ATC.”

Efficiency score

Start with excel file 41467_2021_23576_MOESM4_ESM.xlsx from the CRISPRon paper in the Supplementary Data 1 section. In this file, the authors released the gRNA (20-mer) sequences and corresponding total_indel_eff scores. We specifically used the data from the sheet named spCas9_eff_D10+dox. We use the total_indel_eff column as the efficiency score.

Training and validation data

Given the 20-mers and the crispron scores (same as the total_indel_eff scores) from earlier, complete the following steps to put together the training and validation data:

  1. Convert the sequences in the sheet “TRAP12K microarray oligos” into an .fa (fasta) file.
  2. Run the script get_30mers_from_fa.py (from the CRISPRon GitHub repository) to obtain all possible 23-mers and 30-mers from the sequences obtained from Step 1.
  3. Use the CRISPRspec_CRISPRoff_pipeline.py script (from the CRISPRon GitHub repository) to obtain the binding energy for the 23-mers obtained from Step 2. For more details on how to run this script, check out the code released by the authors of the CRISPRon paper(check the script CRISPRon.sh).
  4. At this point, we have 23-mers along with the corresponding binding energy scores, and 20-mers along with the corresponding CRISPRon scores. Additionally, we have the 30-mers from Step 2.
  5. Use the script prepare_train_dev_data.py (from our released code) to create training and validation splits. Running this script will create two files: train.csv and dev.csv.

The data looks something like the following:

id,rna,crisproff_score,crispron_score
seq2875_p_129,GTCCAGCCACCGAGACCCTGTGTATGGCAC,24.74484099890205,85.96491228
seq2972_p_129,AAAGGCGAAGCAGTATGTTCTAAAAGGAGG,17.216228493196073,94.81132075
. . .
. . .

Model architecture for gRNA encoding

To encode the gRNA sequence, we used the DNABERT encoder. DNABERT was pre-trained on human genomic data, so it’s a good model to encode gRNA sequences. DNABERT tokenizes the nucleotide sequence into overlapping k-mers, and each k-mer serves as a word in the DNABERT model’s vocabulary. The gRNA sequence is broken into a sequence of k-mers, and then each k-mer is replaced by an embedding for the k-mer at the input layer. Otherwise, the architecture of DNABERT is similar to that of BERT. After we encode the gRNA, we use the representation of the [CLS] token as the final encoding of the gRNA sequence. To predict the efficiency score, we use an additional regression layer. The MSE loss will be the training objective. The following is a code snippet of the DNABertForSequenceClassification model:

class DNABertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.config = config
        
        self.bert = BertModel(config)
        classifier_dropout = (
            config.classifier_dropout
            if config.classifier_dropout is not None
            else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = (
            return_dict if return_dict is not None else self.config.use_return_dict
        )

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        print('bert outputs', outputs)
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (
                    labels.dtype == torch.long or labels.dtype == torch.int
                ):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)
        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

Fine-tuning and prompting genomic LLMs

Fine-tuning all the parameters of a model is expensive because the pre-trained model becomes much larger. LoRA is an innovative technique developed to address the challenge of fine-tuning extremely large language models. LoRA offers a solution by suggesting that the pre-trained model’s weights remain fixed while introducing trainable layers (referred to as rank-decomposition matrices) within each transformer block. This approach significantly reduces the number of parameters that need to be trained and lowers the GPU memory requirements, because most model weights don’t require gradient computations.

Therefore, we adopted LoRA as a PEFT method on the DNABERT model. LoRA is implemented in the Hugging Face PEFT library. When using PEFT to train a model with LoRA, the hyperparameters of the low rank adaptation process and the way to wrap base transformers models can be defined as follows:

from peft import LoraConfig

tokenizer = AutoTokenizer.from_pretrained(
        data_training_args.model_path,
        do_lower_case=False
    )
# DNABertForSequenceClassification is a model class for sequence classification task, which is built on top of the DNABert architecture.    
model = DNABertForSequenceClassification.from_pretrained(
        data_training_args.model_path,
        config=config
    )
    
# Define LoRA Config
LORA_R = 16
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
peft_config = LoraConfig(
                     r=LORA_R, # the dimension of the low-rank matrices
                     lora_alpha=LORA_ALPHA, #scaling factor for the weight matrices
                     lora_dropout=LORA_DROPOUT, #dropout probability of the LoRA layers
                     bias="none",
                     task_type = 'SEQ_CLS'
    )
model = get_peft_model(model, peft_config)

Hold-out evaluation performances

We use RMSE, MSE, and MAE as evaluation metrics, and we tested with rank 8 and 16. Furthermore, we implemented a simple fine-tuning method, which is simply adding several dense layers after the DNABERT embeddings. The following table summarizes the results.

Method RMSE MSE MAE
LoRA (rank = 8) 11.933 142.397 7.014
LoRA (rank = 16) 13.039 170.01 7.157
One dense layer 15.435 238.265 9.351
Three dense layer 15.435 238.241 9.505
CRISPRon 11.788 138.971 7.134

When rank=8, we have 296,450 trainable parameters, which is about 33% trainable of the whole. The performance metrics are “rmse”: 11.933, “mse”: 142.397, “mae”: 7.014.

When rank=16, we have 591,362 trainable parameters, which is about 66% trainable of the whole. The performance metrics are “rmse”: 13.039, “mse”: 170.010, “mae”: 7.157. There might have some overfitting issue here under this setting.

We also compare what happens when adding a few dense layers:

  • After adding one dense layer, we have “rmse”: 15.435, “mse”: 238.265, “mae”: 9.351
  • After adding three dense layers, we have “rmse”: 15.435, “mse”: 238.241, “mae”: 9.505

Lastly, we compare with the existing CRISPRon method. CRISPRon is a CNN based deep learning model. The performance metrics are “rmse”: 11.788, “mse”: 138.971, “mae”: 7.134.

As expected, LoRA is doing much better than simply adding a few dense layers. Although the performance of LoRA is a bit worse than CRISPRon, with thorough hyperparameter search, it is likely to outperform CRISPRon.

When using SageMaker notebooks, you have the flexibility to save the work and data produced during the training, turn off the instance, and turn it back on when you’re ready to continue the work, without losing any artifacts. Turning off the instance will keep you from incurring costs on compute you’re not using. We highly recommend only turning it on when you’re actively using it.

Conclusion

In this post, we showed how to use PEFT methods for fine-tuning DNA language models using SageMaker. We focused on predicting efficiency of CRISPR-Cas9 RNA sequences for their impact in current gene-editing technologies. We also provided code that can help you jumpstart your biology applications in AWS.

To learn more about the healthcare and life science space, refer to Run AlphaFold v2.0 on Amazon EC2 or fine-tuning Fine-tune and deploy the ProtBERT model for protein classification using Amazon SageMaker.


About the Authors

Siddharth Varia is an applied scientist in AWS Bedrock. He is broadly interested in natural language processing and has contributed to AWS products such as Amazon Comprehend. Outside of work, he enjoys exploring new places and reading. He got interested in this project after reading the book The Code Breaker.

Yudi Zhang is an Applied Scientist at AWS marketing. Her research interests are in the area of graph neural networks, natural language processing, and statistics.

Erika Pelaez Coyotl is a Sr Applied Scientist in Amazon Bedrock, where she’s currently helping develop the Amazon Titan large language model. Her background is in biomedical science, and she has helped several customers develop ML models in this vertical.

Zichen Wang is a Sr Applied Scientist in AWS AI Research & Education. He is interested in researching graph neural networks and applying AI to accelerate scientific discovery, specifically on molecules and simulations.

Rishita Anubhai is a Principal Applied Scientist in Amazon Bedrock. She has deep expertise in natural language processing and has contributed to AWS projects like Amazon Comprehend, Machine Learning Solutions Lab, and development of Amazon Titan models. She’s keenly interested in using machine learning research, specifically deep learning, to create tangible impact.

Read More

Improve RAG performance using Cohere Rerank

Improve RAG performance using Cohere Rerank

This post is co-written with Pradeep Prabhakaran from Cohere.

Retrieval Augmented Generation (RAG) is a powerful technique that can help enterprises develop generative artificial intelligence (AI) apps that integrate real-time data and enable rich, interactive conversations using proprietary data.

RAG allows these AI applications to tap into external, reliable sources of domain-specific knowledge, enriching the context for the language model as it answers user queries. However, the reliability and accuracy of the responses hinges on finding the right source materials. Therefore, honing the search process in RAG is crucial to boosting the trustworthiness of the generated responses.

RAG systems are important tools for building search and retrieval systems, but they often fall short of expectations due to suboptimal retrieval steps. This can be enhanced using a rerank step to improve search quality.

RAG is an approach that combines information retrieval techniques with natural language processing (NLP) to enhance the performance of text generation or language modeling tasks. This method involves retrieving relevant information from a large corpus of text data and using it to augment the generation process. The key idea is to incorporate external knowledge or context into the model to improve the accuracy, diversity, and relevance of the generated responses.

Workflow of RAG Orchestration

The RAG orchestration generally consists of two steps:

  1. Retrieval – RAG fetches relevant documents from an external data source using the generated search queries. When presented with the search queries, the RAG-based application searches the data source for relevant documents or passages.
  2. Grounded generation – Using the retrieved documents or passages, the generation model creates educated answers with inline citations using the fetched documents.

The following diagram shows the RAG workflow.

Document retrieval in RAG orchestration

One technique for retrieving documents in a RAG orchestration is dense retrieval, which is an approach to information retrieval that aims to understand the semantic meaning and intent behind user queries. Dense retrieval finds the closest documents to a user query in the embedding, as shown in the following screenshot.

The goal of dense retrieval is to map both the user queries and documents (or passages) into a dense vector space. In this space, the similarity between the query and document vectors can be computed using standard distance metrics like cosine similarity or euclidean distance. The documents that match closest to the semantic meaning of the user query based on the calculated distance metrics are then presented back to the user.

The quality of the final responses to search queries is significantly influenced by the relevance of the retrieved documents. While dense retrieval models are very efficient and can scale to large datasets, they struggle with more complex data and questions due to the simplicity of the method. Document vectors contain the meaning of text in a compressed representation—typically 786-1536 dimension vectors. This often results in loss of information because information is compressed into a single vector. When documents are retrieved during a vector search the most relevant information is not always presented at the top of the retrieval.

Boost search accuracy with Cohere Rerank

To address the challenges with accuracy, search engineers have used two-stage retrieval as a means of increasing search quality. In these two-stage systems, a first-stage model (an embedding model or retriever) retrieves a set of candidate documents from a larger dataset. Then, a second-stage model (the reranker) is used to rerank those documents retrieved by the first-stage model.

A reranking model, such as Cohere Rerank, is a type of model that will output a similarity score when given a query and document pair. This score can be used to reorder the documents that are most relevant to the search query. Among the reranking methodologies, the Cohere Rerank model stands out for its ability to significantly enhance search accuracy. The model diverges from traditional embedding models by employing deep learning to evaluate the alignment between each document and the query directly. Cohere Rerank outputs a relevance score by processing the query and document in tandem, which results in a more nuanced document selection process.

In the following example, the application was presented with a query: “When was the transformer paper coauthored by Aidan Gomez published?” The top-k with k = 6 returned the results shown in the image, in which the retrieved result set did contain the most accurate result, although it was at the bottom of the list. With k = 3, the most relevant document would not be included in the retrieved results.

Cohere Rerank aims to reassess and reorder the relevance of the retrieved documents based on additional criteria, such as semantic content, user intent, and contextual relevance, to output a similarity score. This score is then used to reorder the documents by relevance of the query. The following image shows reorder results using Rerank.

By applying Cohere Rerank after the first-stage retrieval, the RAG orchestration can gain the benefits of both approaches. While first-stage retrieval helps to capture relevant items based on proximity matches within the vector space, reranking helps optimize search according to results by guaranteeing contextually relevant results are surfaced to the top. The following diagram demonstrates this improved efficiency.

The latest version of Cohere Rerank, Rerank 3, is purpose-built to enhance enterprise search and RAG systems. Rerank 3 offers state-of-the-art capabilities for enterprise search, including:

  • 4k context length to significantly improve search quality for longer documents
  • Ability to search over multi-aspect and semi-structured data (such as emails, invoices, JSON documents, code, and tables)
  • Multilingual coverage of more than 100 languages
  • Improved latency and lower total cost of ownership (TCO)

The endpoint takes in a query and a list of documents, and it produces an ordered array with each document assigned a relevance score. This provides a powerful semantic boost to the search quality of any keyword or vector search system without requiring any overhaul or replacement.

Developers and businesses can access Rerank on Cohere’s hosted API and on Amazon SageMaker. This post offers a step-by-step walkthrough of consuming Cohere Rerank on Amazon SageMaker.

Solution overview

This solution follows these high-level steps:

  1. Subscribe to the model package
  2. Create an endpoint and perform real-time inference

Prerequisites

For this walkthrough, you must have the following prerequisites:

  1. The cohere-aws notebook.

This is a reference notebook, and it cannot run unless you make changes suggested in the notebook. It contains elements that render correctly in the Jupyter interface, so you need to open it from an Amazon SageMaker notebook instance or in Amazon SageMaker Studio.

  1. An AWS Identity and Access Management (IAM) role with the AmazonSageMakerFullAccess policy attached. To deploy this machine learning (ML) model successfully, choose one of the following options:
    1. If your AWS account does not have a subscription to Cohere Rerank 3 Model – Multilingual, your IAM role needs to have the following three permissions, and you need to have the authority to make AWS Marketplace subscriptions in the AWS account used:
      • aws-marketplace:ViewSubscriptions
      • aws-marketplace:Unsubscribe
      • aws-marketplace:Subscribe
    2. If your AWS account has a subscription to Cohere Rerank 3 Model – Multilingual, you can skip the instructions for subscribing to the model package.

Refrain from using full access in production environments. Security best practice is to opt for the principle of least privilege.

Implement Rerank 3 on Amazon SageMaker

To improve RAG performance using Cohere Rerank, use the instructions in the following sections.

Subscribe to the model package

To subscribe to the model package, follow these steps:

  1. In AWS Marketplace, open the model package listing page Cohere Rerank 3 Model – Multilingual
  2. Choose Continue to Subscribe.
  3. On the Subscribe to this software page, review the End User License Agreement (EULA), pricing, and support terms and choose Accept Offer.
  4. Choose Continue to configuration and then choose a Region. You will see a Product ARN displayed, as shown in the following screenshot. This is the model package Amazon Resource Name (ARN) that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your Region and enter it in the following cell.

The code snippets included in this post are sourced from the aws-cohere notebook. If you encounter any issues with this code, refer to the notebook for the most up-to-date version.

!pip install --upgrade cohere-aws
# if you upgrade the package, you need to restart the kernel

from cohere_aws import Client
import boto3

On the Configure for AWS CloudFormation page shown in the following screenshot, under Product Arn, make a note of the last part of the product ARN to use as the value in the variable cohere_package in the following code.

cohere_package = " cohere-rerank-multilingual-v3--13dba038aab73b11b3f0b17fbdb48ea0"

model_package_map = {

"us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}",

"us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}",

"us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}",

"us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}",

"ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}",

"eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}",

"eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}",

"eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}",

"eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}",

"eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}",

"ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}",

"ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}",

"ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}",

"ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}",

"ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}",

"sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}",

}

region = boto3.Session().region_name

if region not in model_package_map.keys():

raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, refer to the Amazon SageMaker Developer Guide.

Create an endpoint

To create an endpoint, use the following code.

co = Client(region_name=region)

co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-rerank-multilingual-v3-0", instance_type="ml.g5.2xlarge", n_instances=1)

# If the endpoint is already created, you just need to connect to it

# co.connect_to_endpoint(endpoint_name="cohere-rerank-multilingual-v3-0”)

After the endpoint is created, you can perform real-time inference.

Create the input payload

To create the input payload, use the following code.

documents = [
    {"Title":"Contraseña incorrecta","Content":"Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?"},
    {"Title":"Confirmation Email Missed","Content":"Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?"},
    {"Title":"أسئلة حول سياسة الإرجاع","Content":"مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب"},
    {"Title":"Customer Support is Busy","Content":"Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?"},
    {"Title":"Falschen Artikel erhalten","Content":"Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken."},
    {"Title":"Customer Service is Unavailable","Content":"Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?"},
    {"Title":"Return Policy for Defective Product","Content":"Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."},
    {"Title":"收到错误物品","Content":"早上好,关于我最近的订单,我有一个问题。我收到了错误的商品,需要退货。"},
    {"Title":"Return Defective Product","Content":"Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."}
]

 

Perform real-time inference

To perform real-time inference, use the following code.

 

response = co.rerank(documents=documents, query='What emails have been about returning items?', rank_fields=["Title","Content"], top_n=5)

Visualize output

To visualize output, use the following code.

print(f'Documents: {response}')

The following screenshot shows the output response.

Cleanup

To avoid any recurring charges, use the following steps to clean up the resources created in this walkthrough.

Delete the model

Now that you have successfully performed a real-time inference, you do not need the endpoint anymore. You can terminate the endpoint to avoid being charged.

co.delete_endpoint()
co.close()

Unsubscribe to the listing (optional)

If you want to unsubscribe to the model package, follow these steps. Before you cancel the subscription, make sure that you don’t have a deployable model created from the model package or using the algorithm. You can find this information by looking at the container name associated with the model.

Steps to unsubscribe from the product from AWS Marketplace:

  1. On the Your Software subscriptions page, choose the Machine Learning tab
  2. Locate the listing that you want to cancel the subscription for, and then choose Cancel Subscription

Summary

RAG is a capable technique for developing AI applications that integrate real-time data and enable interactive conversations using proprietary information. RAG enhances AI responses by tapping into external, domain-specific knowledge sources, but its effectiveness depends on finding the right source materials. This post focuses on improving search efficiency and accuracy in RAG systems using Cohere Rerank. RAG orchestration typically involves two steps: retrieval of relevant documents and generation of answers. While dense retrieval is efficient for large datasets, it can struggle with complex data and questions due to information compression. Cohere Rerank uses deep learning to evaluate the alignment between documents and queries, outputting a relevance score that enables more nuanced document selection.

Customers can find Cohere Rerank 3 and Cohere Rerank 3 Nimble on Amazon Sagemaker Jumpstart.


About the Authors

Shashi Raina is a Senior Partner Solutions Architect at Amazon Web Services (AWS), where he specializes in supporting generative AI (GenAI) startups. With close to 6 years of experience at AWS, Shashi has developed deep expertise across a range of domains, including DevOps, analytics, and generative AI.

Pradeep Prabhakaran is a Senior Manager – Solutions Architecture at Cohere. In his current role at Cohere, Pradeep acts as a trusted technical advisor to customers and partners, providing guidance and strategies to help them realize the full potential of Cohere’s cutting-edge Generative AI platform.

Read More

Unlock AWS Cost and Usage insights with generative AI powered by Amazon Bedrock

Unlock AWS Cost and Usage insights with generative AI powered by Amazon Bedrock

Managing cloud costs and understanding resource usage can be a daunting task, especially for organizations with complex AWS deployments. AWS Cost and Usage Reports (AWS CUR) provides valuable data insights, but interpreting and querying the raw data can be challenging.

In this post, we explore a solution that uses generative artificial intelligence (AI) to generate a SQL query from a user’s question in natural language. This solution can simplify the process of querying CUR data stored in an Amazon Athena database using SQL query generation, running the query on Athena, and representing it on a web portal for ease of understanding.

The solution uses Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Challenges addressed

The following challenges can hinder organizations from effectively analyzing their CUR data, leading to potential inefficiencies, overspending, and missed opportunities for cost-optimization. We aim to target and simplify them using generative AI with Amazon Bedrock.

  • Complexity of SQL queries – Writing SQL queries to extract insights from CUR data can be complex, especially for non-technical users or those unfamiliar with the CUR data structure (unless you’re a seasoned database administrator)
  • Data accessibility – To gain insights from structured data in databases, users need to get access to databases, which can be a potential threat to overall data protection
  • User-friendliness – Traditional methods of analyzing CUR data often lack a user-friendly interface, making it challenging for non-technical users to take advantage of the valuable insights hidden within the data

Solution overview

The solution that we discuss is a web application (chatbot) that allows you to ask questions related to your AWS costs and usage in natural language. The application generates SQL queries based on the user’s input, runs them against an Athena database containing CUR data, and presents the results in a user-friendly format. The solution combines the power of generative AI, SQL generation, database querying, and an intuitive web interface to provide a seamless experience for analyzing CUR data.

The solution uses the following AWS services:

 The following diagram illustrates the solution architecture.

Figure 1. Architecture of Solution

Figure 1. Architecture of Solution

The data flow consists of the following steps:

  1. The CUR data is stored in Amazon S3.
  2. Athena is configured to access and query the CUR data stored in Amazon S3.
  3. The user interacts with the Streamlit web application and submits a natural language question related to AWS costs and usage.
Figure 2. Shows the Chatbot Dashboard to ask question

Figure 2. Shows the Chatbot Dashboard to ask question

  1. The Streamlit application sends the user’s input to Amazon Bedrock, and the LangChain application facilitates the overall orchestration.
  2. The LangChain code uses the BedrockChat class from LangChain to invoke the FM and interact with Amazon Bedrock to generate a SQL query based on the user’s input.
Figure 3. Shows initialization of SQL chain

Figure 3. Shows initialization of SQL chain

  1. The generated SQL query is run against the Athena database using the FM on Amazon Bedrock, which queries the CUR data stored in Amazon S3.
  2. The query results are returned to the LangChain application.
Figure 4. Shows generated Query in the application output logs

Figure 4. Shows generated Query in the application output logs

  1. LangChain sends the SQL query and query results back to the Streamlit application.
  2. The Streamlit application displays the SQL query and query results to the user in a formatted and user-friendly manner.
Figure 5. Shows final output presented on the chat bot webapp including SQL Query and the Query results

Figure 5. Shows final output presented on the chat bot webapp including SQL Query and the Query results

Prerequisites

To set up this solution, you should have the following prerequisites:

Configure the solution

Complete the following steps to set up the solution:

  1. Create an Athena database and table to store your CUR data. Make sure the necessary permissions and configurations are in place for Athena to access the CUR data stored in Amazon S3.
  2. Set up your compute environment to call Amazon Bedrock APIs. Make sure you associate an IAM role with this environment that has IAM policies that grant access to Amazon Bedrock.
  3. When your instance is up and running, install the following libraries that are used for working within the environment:
pip install langchain==0.2.0 langchain-experimental==0.0.59 langchain-community==0.2.0 langchain-aws==0.1.4 pyathena==3.8.2 sqlalchemy==2.0.30 streamlit==1.34.0
  1. Use the following code to establish a connection to the Athena database using the langchain library and the pyathena Configure the language model to generate SQL queries based on user input using Amazon Bedrock. You can save this file as cur_lib.py.
from langchain_experimental.sql import SQLDatabaseChain
from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine, URL
from langchain_aws import ChatBedrock as BedrockChat
from pyathena.sqlalchemy.rest import AthenaRestDialect

class CustomAthenaRestDialect(AthenaRestDialect):
    def import_dbapi(self):
        import pyathena
        return pyathena

# DB Variables
connathena = "athena.us-west-2.amazonaws.com"
portathena = '443'
schemaathena = 'mycur'
s3stagingathena = 's3://cur-data-test01/athena-query-result/'
wkgrpathena = 'primary'
connection_string = f"awsathena+rest://@{connathena}:{portathena}/{schemaathena}?s3_staging_dir={s3stagingathena}/&work_group={wkgrpathena}"
url = URL.create("awsathena+rest", query={"s3_staging_dir": s3stagingathena, "work_group": wkgrpathena})
engine_athena = create_engine(url, dialect=CustomAthenaRestDialect(), echo=False)
db = SQLDatabase(engine_athena)

# Setup LLM
model_kwargs = {"temperature": 0, "top_k": 250, "top_p": 1, "stop_sequences": ["nnHuman:"]}
llm = BedrockChat(model_id="anthropic.claude-3-sonnet-20240229-v1:0", model_kwargs=model_kwargs)

# Create the prompt
QUERY = """
Create a syntactically correct athena query for AWS Cost and Usage report to run on the my_c_u_r table in mycur database based on the question, then look at the results of the query and return the answer as SQLResult like a human
{question}
"""
db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)

def get_response(user_input):
    question = QUERY.format(question=user_input)
    result = db_chain.invoke(question)
    query = result["result"].split("SQLQuery:")[1].strip()
    rows = db.run(query)
    return f"SQLQuery: {query}nSQLResult: {rows}"
  1. Create a Streamlit web application to provide a UI for interacting with the LangChain application. Include the input fields for users to enter their natural language questions and display the generated SQL queries and query results. You can name this file cur_app.py.
import streamlit as st
from cur_lib import get_response
import os

st.set_page_config(page_title="AWS Cost and Usage Chatbot", page_icon="chart_with_upwards_trend", layout="centered", initial_sidebar_state="auto",
menu_items={
        'Get Help': 'https://docs.aws.amazon.com/cur/latest/userguide/cur-create.html',
        #'Report a bug':,
        'About': "# The purpose of this app is to help you get better understanding of your AWS Cost and Usage report!"
    })#HTML title
st.title("_:orange[Simplify] CUR data_ :sunglasses:")

def format_result(result):
    parts = result.split("nSQLResult: ")
    if len(parts) > 1:
        sql_query = parts[0].replace("SQLQuery: ", "")
        sql_result = parts[1].strip("[]").split("), (")
        formatted_result = []
        for row in sql_result:
            formatted_result.append(tuple(item.strip("(),'") for item in row.split(", ")))
        return sql_query, formatted_result
    else:
        return result, []

def main():
    # Get the current directory
    current_dir = os.path.dirname(os.path.abspath(__file__))
    st.markdown("<div class='main'>", unsafe_allow_html=True)
    st.title("AWS Cost and Usage chatbot")
    st.write("Ask a question about your AWS Cost and Usage Report:")
  1. Connect the LangChain application and Streamlit web application by calling the get_response Format and display the SQL query and result in the Streamlit web application. Append the following code with the preceding application code:
# Create a session state variable to store the chat history
    if "chat_history" not in st.session_state:
        st.session_state.chat_history = []

    user_input = st.text_input("You:", key="user_input")

    if user_input:
        try:
            result = get_response(user_input)
            sql_query, sql_result = format_result(result)
            st.code(sql_query, language="sql")
            if sql_result:
                st.write("SQLResult:")
                st.table(sql_result)
            else:
                st.write(result)
            st.session_state.chat_history.append({"user": user_input, "bot": result})
            st.text_area("Conversation:", value="n".join([f"You: {chat['user']}nBot: {chat['bot']}" for chat in st.session_state.chat_history]), height=300)
        except Exception as e:
            st.error(str(e))

    st.markdown("</div>", unsafe_allow_html=True)

if __name__ == "__main__":
    main()
  1. Deploy the Streamlit application and LangChain application to your hosting environment, such as Amazon EC2, or a Lambda function.

Clean up

Unless you invoke Amazon Bedrock with this solution, you won’t incur charges for it. To avoid ongoing charges for Amazon S3 storage for saving the CUR reports, you can remove the CUR data and S3 bucket. If you set up the solution using Amazon EC2, make sure you stop or delete the instance when you’re done.

Benefits

This solution offers the following benefits:

  • Simplified data analysis – You can analyze CUR data using natural language using generative AI, eliminating the need for advanced SQL knowledge
  • Increased accessibility – The web-based interface makes it efficient for non-technical users to access and gain insights from CUR data without needing credentials for the database
  • Time-saving – You can quickly get answers to your cost and usage questions without manually writing complex SQL queries
  • Enhanced visibility – The solution provides visibility into AWS costs and usage, enabling better cost-optimization and resource management decisions

Summary

The AWS CUR chatbot solution uses Anthropic Claude on Amazon Bedrock to generate SQL queries, database querying, and a user-friendly web interface to simplify the analysis of CUR data. By allowing you to ask natural language questions, the solution removes barriers and empowers both technical and non-technical users to gain valuable insights into AWS costs and resource usage. With this solution, organizations can make more informed decisions, optimize their cloud spending, and improve overall resource utilization. We recommend that you do due diligence while setting this up, especially for production; you can choose other programming languages and frameworks to set it up according to your preference and needs.

Amazon Bedrock enables you to build powerful generative AI applications with ease. Accelerate your journey by following the quick start guide on GitHub and using Amazon Bedrock Knowledge Bases to rapidly develop cutting-edge Retrieval Augmented Generation (RAG) solutions or enable generative AI applications to run multistep tasks across company systems and data sources using Amazon Bedrock Agents.


About the Author

Author ImageAnutosh is a Solutions Architect at AWS India. He loves to dive deep into his customers’ use cases to help them navigate through their journey on AWS. He enjoys building solutions in the cloud to help customers. He is passionate about migration and modernization, data analytics, resilience, cybersecurity, and machine learning.

Read More

Streamline workflow orchestration of a system of enterprise APIs using chaining with Amazon Bedrock Agents

Streamline workflow orchestration of a system of enterprise APIs using chaining with Amazon Bedrock Agents

Intricate workflows that require dynamic and complex API orchestration can often be complex to manage. In industries like insurance, where unpredictable scenarios are the norm, traditional automation falls short, leading to inefficiencies and missed opportunities. With the power of intelligent agents, you can simplify these challenges. In this post, we explore how chaining domain-specific agents using Amazon Bedrock Agents can transform a system of complex API interactions into streamlined, adaptive workflows, empowering your business to operate with agility and precision.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Benefits of chaining Amazon Bedrock Agents

Designing agents is like designing other software components—they tend to work best when they have a focused purpose. When you have focused, single-purpose agents, combining them into chains can allow them to solve significantly complex problems together. Using natural language processing (NLP) and OpenAPI specs, Amazon Bedrock Agents dynamically manages API sequences, minimizing dependency management complexities. Additionally, agents enable conversational context management in real-time scenarios, using session IDs and, if necessary, backend databases like Amazon DynamoDB for extended context storage. By using prompt instructions and API descriptions, agents collect essential information from API schemas to solve specific problems efficiently. This approach not only enhances agility and flexibility, but also demonstrates the value of chaining agents to simplify complex workflows and solve larger problems effectively.

In this post, we explore an insurance claims use case, where we demonstrate the concept of chaining with Amazon Bedrock Agents. This involves an orchestrator agent calling and interacting with other agents to collaboratively perform a series of tasks, enabling efficient workflow management.

Solution overview

For our use case, we develop a workflow for an insurance digital assistant focused on streamlining tasks such as filing claims, assessing damages, and handling policy inquiries. The workflow simulates API sequencing dependencies, such as conducting fraud checks during claim creation and analyzing uploaded images for damage assessment if the user provides images. The orchestration dynamically adapts to user scenarios, guided by natural language prompts from domain-specific agents like an insurance orchestrator agent, policy information agent, and damage analysis notification agent. Using OpenAPI specifications and natural language prompts, the API sequencing in our insurance digital assistant adapts to dynamic user scenarios, such as users opting in or out of image uploads for damage assessment, failing fraud checks or choosing to ask a variety of questions related to their insurance policies and coverages. This flexibility is achieved by chaining domain-specific agents like the insurance orchestrator agent, policy information agent, and damage analysis notification agent.

Traditionally, insurance processes are rigid, with fixed steps for tasks like fraud detection. However, agent chaining allows for greater flexibility and adaptability, enabling the system to respond to real-time user inputs and variations in scenarios. For instance, instead of strictly adhering to predefined thresholds for fraud checks, the agents can dynamically adjust the workflow based on user interactions and context. Similarly, when users choose to upload images while filing a claim, the workflow can perform real-time damage analysis and immediately send a summary to claims adjusters for further review. This enables a quicker response and more accurate decision-making. This approach not only streamlines the claims process but also allows for a more nuanced and efficient handling of tasks, providing the necessary balance between automation and human intervention. By chaining Amazon Bedrock Agents, we create a system that is adaptable. This system caters to diverse user needs while maintaining the integrity of business processes.

The following diagram illustrates the end-to-end insurance claims workflow using chaining with Amazon Bedrock Agents.

End to end architecture of insurance claims workflow

The diagram shows how specialized agents use various tools to streamline the entire claims process—from filing claims and assessing damages to answering customer questions about insurance policies.

Prerequisites

Before proceeding, make sure you have the following resources set up:

Deploy the solution with AWS CloudFormation

Complete the following steps to set up the solution resources:

  1. Sign in to the AWS Management Console as an IAM administrator or appropriate IAM user.
  2. Choose Launch Stack to deploy the CloudFormation template.
  3. Provide the necessary parameters and create the stack.

For this setup, we use us-east-1 as our AWS Region, the Anthropic Claude 3 Haiku model for orchestrating the flow between the different agents, the Anthropic Claude 3 Sonnet model for damage analysis of the uploaded images, and the Cohere Embed English V3 model as an embedding model to translate text from the insurance policy documents into numerical vectors, which allows for efficient search, comparison, and categorization of the documents.

If you want to choose other models on Amazon Bedrock, you can do so by making appropriate changes in the CloudFormation template. Check for appropriate model support in the Region and the features that are supported by the models.

This will take about 15 minutes to deploy the solution. After the stack is deployed, you can view the various outputs of the CloudFormation stack on the Outputs tab, as shown in the following screenshot.

Cloudformation output from deployed stack

The following screenshot shows the three Amazon Bedrock agents that were deployed in your account.

All deployed Bedrock agents

Test the claims creation, damage detection, and notification workflows

The first part of the deployed solution is to mimic filing a new insurance claim, fraud detection, optional damage analysis of uploading images, and subsequent notification to claims adjusters. This is a smaller version of task automation to fulfill a particular business problem achieved by chaining agents, each performing a set of specific tasks. The agents work in harmony to solve the larger function of insurance claims handling.

Let’s explore the architecture of the claim creation workflow, where the insurance orchestrator agent and the damage analysis notification agent work together to simulate filing new claims, assessing damages, and sending a summary of damages to the claim adjusters for human oversight. The following diagram illustrates this workflow.

Workflow to simulate filing new claims, assessing damages, and sending a summary of damages to the claim adjusters

In this workflow, the insurance orchestrator agent mimics fraud detection and claims creation as well as orchestrates handing off the responsibility to other task-specific agents. The image damage analysis notification agent is responsible for doing a preliminary analysis of the images uploaded for a damage. This agent invokes a Lambda function that internally calls the Anthropic Claude Sonnet large language model (LLM) on Amazon Bedrock to perform preliminary analysis on the images. The LLM generates a summary of the damage, which is sent to an SQS queue, and is subsequently reviewed by the claim adjusters.

The NLP instruction prompts combined with the OpenAPI specifications for each action group guide the agents in their decision-making process, determining which action group to invoke, the sequence of invocation, and the required parameters for calling specific APIs.

Use the UI to invoke the claims processing workflow

Complete the following steps to invoke the claims processing workflow:

  1. From the outputs of the CloudFormation stack, choose the URL for HttpApiEndpoint.

HttpAPI endpoint for accessing the UI

  1. You can ask the chatbots sample questions to start exploring the functionality of filing a new claim.

UI Flow for create claims process

In the following example, we ask for filing a new claim and uploading images as evidence for the claim.

  1. On the Amazon SQS console, you can view the SQS queue that has been created by the CloudFormation stack and check the message that shows the damage analysis from the image performed by our LLM.

Damage analysis message sent to claims adjuster

Test the policy information workflow

The following diagram shows the architecture of just the policy information agent. The policy agent accesses the Policy Information API to extract answers to insurance-related questions from unstructured policy documents such as PDF files.

End to end workflow of policy information retrieval

The policy information agent is responsible for doing a lookup against the insurance policy documents stored in the knowledge base. The agent invokes a Lambda function that will internally invoke the knowledge base to find answers to policy-related questions.

Set up the policy documents and metadata in the data source for the knowledge base

We use Amazon Bedrock Knowledge Bases to manage our documents and metadata. As part of deploying the solution, the CloudFormation stack created a knowledge base. Complete the following steps to set up its data source:

  1. On the Amazon Bedrock console, navigate to the deployed knowledge base and navigate to the S3 bucket that is mentioned as its data source.

Knowledge Base

  1. Upload a few insurance policy documents and metadata documents to the S3 bucket to mimic the naming conventions as shown in the following screenshot.

The naming conventions are <Type of Policy>_PolicyNumber.pdf for the insurance policy PDF documents and <Type of Policy>_PolicyNumber.pdf.metadata.json for the metadata documents.

Insurance policy documents and their respective metadata files

The following screenshot shows an example of what a sample metadata.json file looks like.

metadata.json file format

  1. After the documents are uploaded to Amazon S3, navigate to the deployed knowledge base, select the data source, and choose Sync.

To understand more about how metadata support in Knowledge Bases on Amazon Bedrock helps you get accurate results, refer to Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy.

  1. Now you can go back to the UI and start asking questions related to the policy documents.

The following screenshot shows the set of questions we asked for finding answers related to policy coverage.

Policy Q&A

Clean up

To avoid unexpected charges, complete the following steps to clean up your resources:

  1. Delete the contents from the S3 buckets corresponding to the ImageBucketName and PolicyDocumentsBucketName keys from the outputs of the CloudFormation stack.
  2. Delete the deployed stack using the AWS CloudFormation console.

Best practices

The following are some additional best practices that you can follow for your agents:

  • Automated testing – Implement automated tests using tools to regularly test the orchestration workflows. You can use mock APIs to simulate various scenarios and validate the agent’s decision-making process.
  • Version control – Maintain version control for your agent configurations and prompts in a repository. This provides traceability and quick rollback if needed.
  • Monitoring and logging – Use Amazon CloudWatch to monitor agent interactions and API calls. Set up alarms for unexpected behaviors or failures.
  • Continuous integration – Set up a continuous integration and delivery (CI/CD) pipeline that integrates automated testing, prompt validation, and deployment to maintain smooth updates without disrupting ongoing workflows.

Conclusion

In this post, we demonstrated the power of chaining Amazon Bedrock agents, offering a fresh perspective on integrating back-office automation workflows and enterprise APIs. This solution offers several benefits: as new enterprise APIs emerge, dependencies in existing ones can be minimized, reducing coupling. Moreover, Amazon Bedrock Agents can maintain conversational context, enabling follow-up queries to use conversation history. For extended contextual memory, a more persistent backend implementation can be considered.

To learn more, refer to Amazon Bedrock Agents.


About the Author


Author - Piyali KamraPiyali Kamra is a seasoned enterprise architect and a hands-on technologist who has over two decades of experience building and executing large scale enterprise IT projects across geographies. She believes that building large scale enterprise systems is not an exact science but more like an art, where you can’t always choose the best technology that comes to one’s mind but rather tools and technologies must be carefully selected based on the team’s culture , strengths, weaknesses and risks, in tandem with having a futuristic vision as to how you want to shape your product a few years down the road.

Read More