Derive meaningful and actionable operational insights from AWS Using Amazon Q Business

Derive meaningful and actionable operational insights from AWS Using Amazon Q Business

As a customer, you rely on Amazon Web Services (AWS) expertise to be available and understand your specific environment and operations. Today, you might implement manual processes to summarize lessons learned, obtain recommendations, or expedite the resolution of an incident. This can be time consuming, inconsistent, and not readily accessible.

This post shows how to use AWS generative artificial intelligence (AI) services, like Amazon Q Business, with AWS Support cases, AWS Trusted Advisor, and AWS Health data to derive actionable insights based on common patterns, issues, and resolutions while using the AWS recommendations and best practices enabled by support data. This post will also demonstrate how you can integrate these insights with your IT service management (ITSM) system (such as ServiceNow, Jira, and Zendesk), to allow you to implement recommendations and keep your AWS operations healthy.

Amazon Q Business is a fully managed, secure, generative-AI powered enterprise chat assistant that enables natural language interactions with your organization’s data. Ingesting data for support cases, Trusted Advisor checks, and AWS Health notifications into Amazon Q Business enables interactions through natural language conversations, sentiment analysis, and root cause analysis without needing to fully understand the underlying data models or schemas. The AI assistant provides answers along with links that point directly to the data sources. This allows you to easily identify and reference the underlying information sources that informed the AI’s response, providing more context and enabling further exploration of the topic if needed. Amazon Q Business integrates with ITSM solutions, allowing recommendations to be tracked and actioned within your existing workflows.

AWS Support offers a range of capabilities powered by technology and subject matter experts that support the success and operational health of your AWS environments. AWS Support provides you with proactive planning and communications, advisory, automation, and cloud expertise to help you achieve business outcomes with increased speed and scale in the cloud. These capabilities enable proactive planning for upcoming changes, expedited recovery from operational disruptions, and recommendations to optimize the performance and reliability of your AWS IT infrastructure.

This solution will demonstrate how to deploy Amazon Q Business and ingest data from AWS Support cases, AWS Trusted Advisor, and AWS Health using the provided code sample to generate insights based on your support data.

Overview of solution

Today, Amazon Q Business provides 43 connectors available to natively integrate with multiple data sources. In this post, we’re using the APIs for AWS Support, AWS Trusted Advisor, and AWS Health to programmatically access the support datasets and use the Amazon Q Business native Amazon Simple Storage Service (Amazon S3) connector to index support data and provide a prebuilt chatbot web experience. The AWS Support, AWS Trusted Advisor, and AWS Health APIs are available for customers with Enterprise Support, Enterprise On-Ramp, or Business support plans.

Q Support Insights (QSI) is the name of the solution provided in the code sample repository. QSI enables insights on your AWS Support datasets across your AWS accounts. The following diagram describes at a high level the QSI solution and components.

Overview of the QSI solution

Figure 1: Overview of the QSI solution

There are two major components in the QSI solution. First, as illustrated in the Linked Accounts group in Figure 1, this solution supports datasets from linked accounts and aggregates your data using the various APIs, AWS Lambda, and Amazon EventBridge. Second, the support datasets from linked accounts are stored in a central S3 bucket that you own, as shown in the Data Collection Account group in the Figure 1. These datasets are then indexed using the Amazon Q Business S3 connector.

Under the hood, the Amazon Q Business S3 connector creates a searchable index of your AWS Support datasets, and gathers relevant important details related to keywords like case titles, descriptions, best practices, keywords, dates, and so on. The generative AI capabilities of Amazon Q Business enable it to synthesize insights and generate natural language responses available for users in the Amazon Q Business web chat experience. Amazon Q Business also supports plugins and actions so users can directly create tickets in the ITSM system without leaving the chat experience.

By default, Amazon Q Business will only produce responses using the data you’re indexing. This behavior is aligned with the use cases related to our solution. If needed, this response setting can be changed to allow Amazon Q to fallback to large language model (LLM) knowledge.

Walkthrough

The high-level steps to deploy the solution are the following:

  1. Create the necessary buckets to contain the support cases exports and deployment resources.
  2. Upload the support datasets (AWS Support cases, AWS Trusted Advisor, and AWS Health) to the S3 data source bucket.
  3. Create the Amazon Q Business application, the data source, and required components using deployment scripts.
  4. Optionally, configure ITSM integration by using one of the available Amazon Q Business built-in plugins.
  5. Synchronize the data source to index the data.
  6. Test the solution through chat.

The full guidance and deployment options are available in the aws-samples Github repository. The solution can be deployed in a single account or in an AWS Organizations. In addition to the data security and protection Amazon Q Business supports, this solution integrates with your identity provider and respects access control lists (ACLs) so users get answers based on their unique permissions. This solution also provides additional controls to include or exclude specific accounts.

Prerequisites

For this solution to work, the following prerequisites are needed:

Create the Amazon Q Business application using the deployment scripts

Using the Amazon Q Business application creation module, you can set up and configure an Amazon Q Business application, along with its crucial components, in an automated manner. These components include an Amazon S3 data source connector, required IAM roles, and Amazon Q Business web experience.

Deploy the Amazon Q Business application

As stated in the preceding prerequisites section, IAM Identity Center must be configured in the same Region (us-east-1 or us-west-2) as your Amazon Q Business application.

To deploy and use the Amazon Q Business application, follow the steps described in the Amazon Q Business application creation module. The steps can be summarized as:

  1. Launch an AWS CloudShell in either the us-east-1 or us-west-2 Region in your data collection central account and clone the repository from GitHub.
  2. Navigate to the repository directory and run the deployment script, providing the required inputs when prompted. As stated in the prerequisites, an S3 bucket name is required in the data collection central account.
  3. After deployment, synchronize the data source, assign access to users and groups, and use the deployed web experience URL to interact with the Amazon Q Business application.

[Optional] Integrate your ITSM system

To integrate with your ITSM system, follow these steps:

  1. Within the Amazon Q Business application page, choose Plugins in the navigation pane and choose Add plugin.
  2. From the list of available plugins, select the one that matches your system. For example, Jira, ServiceNow, or Zendesk.
  3. Enter the details on the next screen (see Figure 2) for Amazon Q Business application to make the connection. This integration will result in directly logging tickets from Amazon Q Business to your IT teams based on data within the Amazon Q Business application.
The Amazon Q Business plug-in creation page

Figure 2 The Amazon Q Business plug-in creation page

Support Collector

You can use the Support Collector module to set up and configure AWS EventBridge to collect support-related data. This data includes information from AWS Support cases, AWS Trusted Advisor, and AWS Health. The collected data is then uploaded to a designated S3 bucket in the data collection account. The solution will retrieve up to 6 months of data by default, though you can change the timeframe to a maximum of 12 months.

Additionally, the Support Collector can synchronize with the latest updates on a daily basis, ensuring that your support data is always up to date. The Support Collector is configured through an AWS Lambda function and EventBridge, offering flexibility in terms of the data sources (AWS Support cases, AWS Trusted Advisor, and AWS Health) you want to include or exclude. You can choose data from one, two, or all three of these sources by configuring the appropriate scheduler.

Deploy the Support Collector

To deploy and use the Support Collector, follow the steps described in the Support Collector module.

The repository contains scripts and resources to automate the deployment of Lambda functions in designated member accounts. The deployed Lambda functions collect and upload AWS Support data (Support Cases, Health Events, and Trusted Advisor Checks) to an S3 bucket in the data collection central account. The collected data can be analyzed using Amazon Q Business.

There are two deployment options:

  1. AWS Organizations (StackSet): Use this option if you have AWS Organizations set up and want to deploy in accounts under organizational units. It creates a CloudFormation StackSet in the central account to deploy resources (IAM roles, Lambda functions, and EventBridge) across member accounts.
  2. Manual deployment of individual accounts (CloudFormation): Use this option if you don’t want to use AWS Organizations and want to target a few accounts. It creates a CloudFormation stack in a member account to deploy resources (IAM roles, Lambda functions, and EventBridge).

After deployment, an EventBridge scheduler periodically invokes the Lambda function to collect support data and store it in the data collection S3 bucket. Testing the Lambda function is possible with a custom payload. The deployment steps are fully automated using a shell script. The Q Support Insights (QSI) – AWS Support Collection Deployment guide, located in the src/support_collector subdirectory, outlines the steps to deploy the resources.

Amazon Q Business web experience

You can ask support-related questions using the Amazon Q Business web experience after you have the relevant support data collected in the S3 bucket and successfully indexed. For steps to configure and collect the data, see the preceding Support Collector section. Using the web experience, you can then ask questions as shown in the following demonstration.

Using Amazon Q Business web experience to get troubleshooting recommendations

Using Amazon Q Business web experience to get performance recommendations

Using Amazon Q Business web experience to get operational recommendations

Using Amazon Q Business web experience to get performance recommendations

Figure 3 Using Amazon Q Business web experience to get performance recommendations

Sample prompts

Try some of the following sample prompts:

  • I am having trouble with EKS add-on installation failures. It is giving ConfigurationConflict errors. Based on past support cases, please provide a resolution.
  • List AWS Account IDs with insufficient IPs
  • List health events with increased error rates
  • List services being deprecated this year
  • My Lambda function is running slow. How can I speed it up?

Clean up

After you’re done testing the solution, you can delete the resources to avoid incurring additional charges. See the Amazon Q Business pricing page for more information. Follow the instructions in the GitHub repository to delete the resources and corresponding CloudFormation templates.

Conclusion

In this post, you deployed a solution that indexes data from your AWS Support datasets stored in Amazon S3 and other AWS data sources like AWS Trusted Advisor and AWS Health. This demonstrates how to use new generative AI services like Amazon Q Business to find patterns across your most frequent issues, author new content such as internal documentation or an FAQ. Using support data presents a valuable opportunity to proactively address and prevent recurring issues in your AWS environment by using insights gained from past experiences. Embracing these insights enables a more resilient and optimized AWS experience tailored to your specific needs.

This solution can be expanded to use other internal data sources your company might use and use natural language to understand optimization opportunities that your teams can implement.


About the authors

ChitreshChitresh Saxena is a Sr. Technical Account Manager specializing in generative AI solutions and dedicated to helping customers successfully adopt AI/ML on AWS. He excels at understanding customer needs and provides technical guidance to build, launch, and scale AI solutions that solve complex business problems.

JonathanJonathan Delfour is a Principal Technical Account Manager supporting Energy customers, providing top-notch support as part of the AWS Enterprise Support team. His technical guidance and unwavering commitment to excellence ensure that customers can leverage the full potential of AWS, optimizing their operations and driving success.

KrishnaKrishna Atluru is an Enterprise Support Lead at AWS. He provides customers with in-depth guidance on improving security posture and operational excellence for their workloads. Outside of work, Krishna enjoys cooking, swimming and travelling.

ArishArish Labroo is a Principal Specialist Technical Account Manager – Builder supporting large AWS customers. He is focused on building strategic tools that help customers get the most value out of Enterprise Support.

ManikManik Chopra is a Principal Technical Account Manager at AWS. He helps customers adopt AWS services and provides guidance in various areas around Data Analytics and Optimization. His areas of expertise include delivering solutions using Amazon QuickSight, Amazon Athena, and various other automation techniques. Outside of work, he enjoys spending time outdoors and traveling.

Read More

Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS

Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS

In today’s rapidly evolving landscape of artificial intelligence (AI), training large language models (LLMs) poses significant challenges. These models often require enormous computational resources and sophisticated infrastructure to handle the vast amounts of data and complex algorithms involved. Without a structured framework, the process can become prohibitively time-consuming, costly, and complex. Enterprises struggle with managing distributed training workloads, efficient resource utilization, and model accuracy and performance. This is where the NVIDIA NeMo Framework comes into play. In this post, we present a step-by-step guide to run distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

NVIDIA NeMo Framework

NVIDIA NeMo is an end-to-end cloud-centered framework for training and deploying generative AI models with billions and trillions of parameters at scale. The NVIDIA NeMo Framework provides a comprehensive set of tools, scripts, and recipes to support each stage of the LLM journey, from data preparation to training and deployment. It offers a variety of customization techniques and is optimized for at-scale inference of models for both language and image applications, using multi-GPU and multi-node configurations. NVIDIA NeMo simplifies generative AI model development, making it more cost-effective and efficient for enterprises. By providing end-to-end pipelines, advanced parallelism techniques, memory-saving strategies, and distributed checkpointing, NVIDIA NeMo makes sure AI model training is streamlined, scalable, and high-performing.

The following are benefits of using NVIDIA NeMo for distributed training:

  • End-to-end pipelines for different stages such as data preparation, training, and more, which allows for a plug-and-play approach for your custom data
  • Parallelism techniques, including the following:
    • Data parallelism
    • Tensor parallelism
    • Pipeline parallelism
    • Sequence parallelism
    • Expert parallelism
    • Context parallelism
  • Memory saving techniques, including the following:
    • Selective activation recompute
    • CPU offloading (activation, weights)
    • Attention, including Flash Attention (FA 1/2, FA-cuDNN), Grouped Query Attention, Multi-Query Attention, and Sliding Window Attention
    • Distributed optimizers, including Torch FSDP, Distributed Optimizer (zero-1)
  • Data loaders for different architectures
  • Distributed checkpointing

Solution overview

You can deploy and manage NVIDIA NeMo using either Slurm or Kubernetes orchestration platforms. Amazon EKS is a managed Kubernetes service that makes it straightforward to run Kubernetes clusters on AWS. It manages the availability and scalability of the Kubernetes control plane, and it provides compute node auto scaling and lifecycle management support to help you run highly available container applications.

Amazon EKS is an ideal platform for running distributed training workloads due to its robust integrations with AWS services and performance features. It seamlessly integrates with Amazon FSx for Lustre, a high-throughput file system, enabling fast data access and management using persistent volume claims with the FSx CSI driver. Amazon EKS also integrates with Amazon CloudWatch for comprehensive logging and monitoring, providing insights into cluster performance and resource utilization. It supports Amazon Simple Storage Service (Amazon S3) for scalable and durable data storage and management, providing accessibility for large datasets. Enhanced network performance is achieved with Elastic Fabric Adapter (EFA), which offers low-latency, high-throughput connectivity between nodes. These features collectively make Amazon EKS a powerful and efficient choice for optimizing AI and machine learning (ML) training workflows.

The following diagram shows the solution architecture.

In this post, we present the steps to run distributed training workloads on an EKS cluster. The high-level steps are as follows:

  1. Set up an EFA enabled 2-node 24xlarge cluster.
  2. Set up an FSx for Lustre file system so you can have a shared data repository for storing training dataset and model checkpoints.
  3. Set up an environment for NVIDIA NeMo.
  4. Modify the NVIDIA NeMo Kubernetes manifests to prepare a dataset and train a model.

Prerequisites

You need to be able to launch a CPU-based Amazon Elastic Compute Cloud (Amazon EC2) instance that you’ll use to create the EKS cluster. When your instance is up and running, SSH into your EC2 instance and install the following CLIs:

These steps may change if you are on a non-Linux platform. Consult the preceding documentation for installing the CLIs on other platforms accordingly. We also require that you have a capacity reservation with p4de.24xlarge instances and have the capacityReservationID.

Launch an EKS cluster

ECR p4de.24xlarge instances have the NVIDIA A100 80GB instances, which are highly popular for distributed training generative AI workloads. For more information, refer to Amazon EC2 Instance Types. In this section, we show how to create an EKS cluster with an On-Demand Capacity Reservation for p4de.24xlarge instances.

  1. We provide the cluster creation config in p4de-cluster-config.yaml. See the following code:
git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/3.test_cases/2.nemo-launcher/EKS

eksctl create cluster -f p4de-cluster-config.yaml

The following are key points to note when creating this cluster:

  • Make sure the kubectl version and the specified Region are correct.
  • Update the capacityReservationID field and make sure to specify the availabilityZones within the managedNodeGroups section, which should be the same Availability Zone ID in which your capacity lives.
  • This configuration will create two managed node groups: one for the system nodes using c5.2xlarge instances and another for running distributed training on p4de.24xlarge instances. Managed node groups will use Amazon EKS optimized AMIs. If you want to provide a custom AMI, you can create a self-managed node group and specify a custom AMI. To find the AMI ID, refer to Retrieving Amazon EKS optimized Amazon Linux AMI IDs. For more details about the Amazon EKS optimized AMI, see the GitHub repo.
  • Make sure efaEnabled is set to true. You can use the same config for creating a cluster with other node groups. For a list of EFA supported instance types, see Supported instance types.
  • Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.
  1. After the cluster is created, you can enable kubectl to communicate with your cluster by adding a new context to the kubectl config file:
    aws eks update-kubeconfig --region region-code --name my-cluster

  2. You can confirm communication with your cluster by running the following command:
    kubectl get svc
    NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
    kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   28h

Next, you can install the AWS EFA Kubernetes Device Plugin. EFA is a network interface for EC2 instances that enhances the performance of inter-node communications, which is critical for distributed training workloads that involve GPUs. This plugin allows Kubernetes to recognize and utilize the EFA device, facilitating high-throughput, low-latency networking necessary for efficient distributed training and deep learning applications.

  1. Install the plugin with the following code:
helm repo add eks https://aws.github.io/eks-charts

helm install efa eks/aws-efa-k8s-device-plugin -n kube-system

The NVIDIA device plugin for Kubernetes enables GPU support within your EKS cluster by exposing the GPUs to the Kubernetes API server through the kubelet. It advertises the available GPU resources, allowing Kubernetes to schedule and manage GPU-accelerated workloads.

  1. Install the plugin with the following code:
    wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml
    
    kubectl apply -f nvidia-device-plugin.yml

  2. Run the following command to verify all the pods:
    kubectl get pods --all-namespaces

  3. You can run kubectl get nodes to verify the nodes.

Alternatively, you can use the EKS node viewer tool to view nodes, their costs, and their status in your cluster. After it’s installed, enter eks-node-viewer to get the following view.

The node viewer displays the IP addresses of our two p4de.24xlarge compute nodes.

  1. We can choose one of these private IP DNS names to further examine and describe the node as follows:
kubectl describe node ip-192-168-165-37.us-west-2.compute.internal

The preceding command describes a lot of detail of the node. To make sure EFA is installed correctly, make sure you see details as shown in the following screenshot.

For p4 nodes, you will see vpc.amazonaws.com/efa:4 and for p5.48xlarge nodes, you should see vpc.amazonaws.com/efa:32.

If EFA is enabled in the node group, make sure that a security group is attached to the nodes that allows a rule to allow all outgoing traffic originating from the same security group. This is required for EFA to work. For instructions, see Get started with EFA and MPI. This security group is intended for testing purposes only. For your production environments, we recommend that you create an inbound SSH rule that allows traffic only from the IP address from which you are connecting, such as the IP address of your computer, or a range of IP addresses in your local network.

Create an FSx for Lustre file system

For distributed training applications, typically hundreds of GPU instances are used, with each node containing multiple GPUs. It is crucial that all nodes can access a shared file system to train on the same dataset efficiently. For this purpose, a high-performance file system with high throughput and low latency is essential. We recommend using the FSx for Lustre file system for large-scale distributed training, because it meets these requirements and provides seamless data access for all nodes involved in the training process.

To have a FSx for Lustre file system mounted on your EKS cluster, complete the following steps:

  1. Use the following scripts to create an AWS Identity and Access Management (IAM) role and attach the FSx policy:
    export FSX_POLICY_NAME=fsx-csi
    
    wget https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/csi/fsx/fsx-policy.json
    export FSX_POLICY_DOC=file://fsx-policy.json
    
    # From EC2 Auto Scaling Group
    export EKS_INSTANCE_PROFILE_NAME=(eks-1ec6fc6b-1a19-d65d-66ac-293ff0a20eb9 )
    
    POLICY_ARN=$(aws iam create-policy --policy-name ${FSX_POLICY_NAME} --policy-document $FSX_POLICY_DOC --query "Policy.Arn" --output text)
    
    INSTANCE_PROFILE=$(aws iam list-instance-profiles --query InstanceProfiles[?InstanceProfileName=="'${EKS_INSTANCE_PROFILE_NAME}'"].{InstanceProfileName:InstanceProfileName} --output text)
    
    ROLE_NAME=$(aws iam get-instance-profile --instance-profile-name ${INSTANCE_PROFILE} --query InstanceProfile.Roles[0].RoleName --output text)
    
    # Attach FSx Policy to role ${ROLE_NAME} ..."
    aws iam attach-role-policy --policy-arn ${POLICY_ARN} --role-name ${ROLE_NAME}
    

  2. Use the following script to create a security group that allows EKS nodes to access the file system:
    # From EC2 console
    export MY_REGION=us-west-2
    # FSX_SUBNET_ID should be same ID the compute nodes are present in. You can get this from the EKS console 
    export FSX_SUBNET_ID=subnet-0edecd850cff2cfad
    # From EC2 Auto Scaling Group
    export FSX_SECURITY_GROUP_NAME=eks-fsx-sg
    
    # Get VPC_ID from EKS console
    export VPC_ID=vpc-04411d49af198a6ea
    
    # Create security group
    export SECURITY_GROUP_ID=$(aws ec2 create-security-group --vpc-id ${VPC_ID} --region ${MY_REGION} --group-name ${FSX_SECURITY_GROUP_NAME} --description "FSx for Lustre Security Group" --query "GroupId" --output text)
    
    export SUBNET_CIDR=$(aws ec2 describe-subnets --region ${MY_REGION} --query Subnets[?SubnetId=="'${FSX_SUBNET_ID}'"].{CIDR:CidrBlock} --output text)
    
    # Ingress rule
    aws ec2 authorize-security-group-ingress --region ${MY_REGION} --group-id ${SECURITY_GROUP_ID} --protocol tcp --port 988 --cidr ${SUBNET_CIDR}
    

  3. Create a 1.2 TB Persistent_2 FSx for Lustre file system from the FSx for Lustre console in the same Availability Zone as your compute instances (FSX_SUBNET_ID), VPC of Amazon EKS (VPC_ID), and the security group you created (SECURITY_GROUP_ID).
  4. After the file system is created, note the file system ID, DNS name, and mount name from the file system details page.

Before mounting the file system, you need to install the FSx CSI driver that allows EKS clusters to manage the lifecycle of FSx for Lustre file systems.

  1. Install the FSx CSI driver as follows:
    echo "Installing FSx CSI driver ..."
    kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"
    
    echo "FSx pods in kube-system namespace ..."
    kubectl -n kube-system get pods | grep fsx
    

  2. Next, to mount the file system, provide scripts in the fsx-storage-class.yaml, fsx-pv.yaml and fsx-pvc.yaml files:
    # Storage Class
    kubectl apply -f fsx-storage-class.yaml
    kubectl get sc
    
    # Persistent Volume
    kubectl apply -f fsx-pv.yaml
    
    # Persistent Volume Claim
    kubectl apply -f fsx-pvc.yaml
    

You can check to make sure that the volumes are in Bound state.

Set up the environment for NVIDIA NeMo

For this post, we use the NVIDIA device plugin for Kubernetes, but if you need to install the GPU Operator, you can do so as follows:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

To enable distributed training, we use the KubeFlow Training Operator, which is essential for managing and scheduling ML training jobs in a Kubernetes environment. This operator simplifies the process of running distributed training jobs by automating the deployment and scaling of the necessary components. See the following code:

# Deploy Kubeflow training operator

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/training-operator/deploy.sh

# Configure RBAC resources

kubectl apply -f ./clusterrole-hpa-access.yaml

kubectl apply -f ./clusterrolebinding-training-operator-hpa-access.yaml

Additionally, we use the KubeFlow MPI Operator for preprocessing training data in parallel. The MPI Operator facilitates running Message Passing Interface (MPI) jobs, which are crucial for parallelizing the preprocessing tasks across multiple nodes, thereby speeding up the training process. See the following code:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml

# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/mpi-operator/clusterrole-mpi-operator.yaml
# Add lease permissions fot mpi-operator cluster role
kubectl apply -f ./clusterrole-mpi-operator.yaml

The NVIDIA NeMo Framework is available publicly in the image nvcr.io/nvidia/nemo:24.01.framework. We provide an AWS optimized Dockerfile for use with P4 and P5 instances. We recommend the following library versions for optimal performance:

ENV EFA_INSTALLER_VERSION=1.30.0
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
ENV NCCL_VERSION=2.19.4-1

You can build and push the image to Amazon Elastic Container Registry (Amazon ECR) as follows:

## AWS
export AWS_REGION=us-west-2
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

## Docker Image
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=nemo-aws
export TAG=":24.01.framework"

docker build -t ${REGISTRY}${IMAGE}${TAG} -f 0.Dockerfile .

echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY

# Create registry if it does not exist
REGISTRY_COUNT=$(aws ecr describe-repositories | grep ${IMAGE} | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
        echo ""
        echo "Creating repository ${IMAGE} ..."
        aws ecr create-repository --repository-name ${IMAGE}
fi

# Push image
docker image push ${REGISTRY}${IMAGE}${TAG}

The NVIDIA NeMo Framework requires users to provide config files with job and model information. You can copy the launcher scripts from the container as follows:

# Run container
docker run -it ${REPOSITORY}${IMAGE}${TAG} bash

# Copy files
docker cp -a <container-id>: /opt/NeMo-Megatron-Launcher/ <Path-to-save-launcher-scripts>

In a Slurm cluster implementation, the launcher scripts, data, and results folder could reside in the file system that both the head node (node from where jobs are submitted) and compute nodes access. But in this Amazon EKS implementation, the node that you used to create the EKS cluster doesn’t have access to EKS file system. To get around this, you can put the launcher scripts in the head node and the results and data folder in the file system that the compute nodes have access to.

Run NVIDIA NeMo on an EKS cluster

We’re now ready to set up NVIDIA NeMo Kubernetes manifests for data preparation and model training. For more information about running it on premises, see Running NeMo Framework on Kubernetes. There are some modifications to be done for it to run on Amazon EKS, as shown in the following steps. We provide the launcher scripts in the GitHub repo.

  1. Modify the launcher_scripts/conf/cluster/k8s.yaml file as follows. The subPath field is the path where FSx for Lustre is mounted, which is /fsx-shared in this case.
    shm_size: 512Gi  # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
    volumes:
      persistentVolumeClaim:
        # This claim should be created before running
        claimName: fsx-pvc
        subPath: fsx-shared  # path is mirrored into pod (no leading slash b/c relative to root)
    
    # NOTE: These args will soon be deprecated
    nfs_server: null  # Hostname or IP address for the NFS server where data is stored.
    nfs_path: null  # Path to store data in the NFS server.
    ib_resource_name: null  # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters. Can also be a list, but must be same length as ib_count
    ib_count: null  # Specify the number of IB devices to include per node in each pod. Can also be a list, but must be same length as ib_resource_name
    ib_network_annotation: ""  # Specify the networks as comma separated values
    dns_policy: null  # Specify a dnsPolicy to use in all pods, if necessary
    

  2. Install the required Python packages; this is required so that NeMo Launcher can submit jobs to the Kubernetes cluster:
sudo apt install python3-pip

pip install -r <Path-to- NeMo-Megatron-Launcher>/requirements.txt

Next, we copy the following folders from the container to the /fsx-shared/data folder:

  • NeMo-Megatron-Launcher/launcher_scripts/data/bpe
  • NeMo-Megatron-Launcher/launcher_scripts/data/nsfw
  1. To copy files from EKS pods, you can start a pod just for this purpose. Create a file fsx-share-test.yaml as follows:
    apiVersion: v1
    kind: Pod
    metadata:
      name: fsx-share-test
    spec:
      containers:
      - name: fsx-share-test
        image: ubuntu
        command: ["/bin/bash"]
        args: ["-c", "while true; do echo  "hello from FSx" - $(date -u) >> /fsx-shared/test.txt; sleep 120; done"]
        volumeMounts:
        - name: fsx-pv
          mountPath: /fsx-shared
      volumes:
      - name: fsx-pv
        persistentVolumeClaim:
          claimName: fsx-pvc
    

  2. Run this pod and copy the files:
    kubectl apply -f fsx-share-test.yaml
    
    kubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/bpe fsx-share-test: /fsx-shared/data/
    
    kubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/nsfw fsx-share-test: /fsx-shared/data/
    

A few files need to be updated for data preparation for it to work with the EKS cluster.

  1. Modify the launcher_scripts/conf/config.yaml file:
    • For cluster, use k8s.
    • For training, use gpt3/126m.
    • For stages, this should be just data_preparation and no other stages.
    • For launcher_scripts_path, use the path to the NeMo Megatron launch scripts, which should end with /launcher_scripts.
    • For data_dir, use /fsx-shared/data (the location to store and read the data).
    • For base_results_dir, use /fsx-shared/results (the location to store the results, checkpoints, and logs).
    • For container, use ${REPOSITORY}${IMAGE}${TAG}
  2. Modify the conf/data_preparation/gpt3/download_gpt3_pile.yaml file:
    • Set node_array_size to 2.
    • Set file_numbers to “0-5”. With five files, it should be around 350 GB of data
  3. Modify the nemo_launcher/core/k8s_templates/data_preparation/data-prep.yaml file:
    • If you get the error that mpirun is not found, add the full path to the executable /opt/amazon/openmpi/bin/mpirun.
    • Add /fsx-shared in the container volume mount path.
    • Add the volume:
volumes:
          - name: fsx-pv
            persistentVolumeClaim:
              claimName: fsx-pvc
  1. Launch the data preparation job:
    python3 main.py

This script creates a Helm chart for the selected stage (in this case, data_preparation) and runs the Helm chart automatically. Refer to Run NeMo Framework on Kubernetes for an explanation of the data preparation process. Make sure python3 is installed.

  1. You can monitor your job status and logs using three commands: helm list, kubectl get pods, and kubectl logs --follow).
  2. When the job is finished, you can remove the Helm chart:
    helm uninstall download-gpt3-pile

You can see the downloaded the data in the /fsx-shared folder by running in one of the pods as kubectl exec -it nlp-worker-0 bash.

Training

Now that our data preparation is complete, we’re ready to train our model with the created dataset. Complete the following steps:

  1. Modify a parameter in the conf/config.yaml file:
    • Set stages to training and no other stages.
  2. Modify parameters in conf/training/gpt3/126m.yaml:
    • Set num_nodes to 2.
    • Set devices to 1.
    • On line 18, change use_distributed_sampler: False to replace_sampler_ddp: False.

Optionally, if you want to use a mock dataset instead of real dataset for testing purposes, you can modify the data section as follows. You are essentially changing data_impl: mmap to data_impl: mock and assigning an empty list to data_prefix.

data:
  data_impl: mock
  splits_string: "99990,8,2"
  seq_length: 2048
  skip_warmup: True
  num_workers: 2
  dataloader_type: single # cyclic
  reset_position_ids: False # Reset position ids after end-of-document token
  reset_attention_mask: False # Reset attention mask after end-of-document token
  eod_mask_loss: False # Mask loss for the end of document tokens
  index_mapping_dir: null
  data_prefix: [] # Should be weight path weight path... for a blended dataset

# You can just comment the default “data_prefix” values like below.
# - ${data_dir}/my-gpt3_00_text_document
# - .0333
  1. Modify the parameters in the nemo_launcher/core/k8s_templates/training/training.yaml file:
  2. Run python3 main.py to start training and you should see the training pods by running kubectl get pods as follows:
    NAME                    READY   STATUS    RESTARTS   AGE
    nlp-training-worker-0   1/1     Running   0          168m
    nlp-training-worker-1   1/1     Running   0          168m
    

In addition to monitoring your job using helm list, kubectl get pods, and kubectl logs –follow, you can also SSH into your pod with kubectl exec and use nvidia-smi to check GPU status.

  1. When the job is finished, you can delete the helm chart:
    helm uninstall gpt3-126m

Model checkpoints are saved at /fsx-shared/results/checkpoints along with other training logs and TensorBoard events. By default, checkpoints are saved at every 2,000 steps. You can modify the conf/training/gpt3/126m.yaml file to make changes in the training setup.

Troubleshooting deployment failures

If deployment fails due to incorrect setup or configuration, complete the following debug steps:

  1. Find the error message by running kubectl logs --follow PODNAME and kubectl describe pod PODNAME.
  2. Stop any running jobs by removing the Helm chart. This can be done by running helm uninstall CHARTNAME.

Pods should be spun down after removing the Helm chart.

  1. You can double-check by running kubectl get pods.
  2. If pods are not spun down, you can manually stop them by running kubectl delete PODNAME.

Based on the error message, you may find errors from:

  • Unready nodes.
  • Missing Operators or CRDs. In this case, make sure your kubectl get pods -A output looks like that shown earlier. If errors exist, try reinstalling Operators and CRDs.
  • NeMo Framework scripts or Kubernetes manifests. This is more likely a bug or wrong setup on the NeMo side. Errors can vary.

Clean up

It’s important to spin down resources after model training in order to avoid costs associated with running idle instances. To clean up our setup, we must delete the FSx for Lustre file system before deleting the cluster because it’s associated with a subnet in the cluster’s VPC.

  1. To delete the file system integration with the EKS cluster, run the following command:
    kubectl delete -f ./fsx-storage-class.yaml

Not only will this delete the persistent volume, it will also delete the EFS file system and all the data on the file system will be lost.

  1. When Step 1 is complete, delete the cluster by using the following script:
    eksctl delete cluster -f p4de-cluster-config.yaml

This will delete all the existing pods, remove the cluster, and delete the VPC you created in the beginning.

Conclusion

In this post, we demonstrated how to train generative AI models at scale using the NeMo Framework within an EKS cluster. We covered the challenges of training LLMs and how NeMo’s comprehensive tools and optimizations address these challenges, making the process more efficient and cost-effective. With NeMo, you can manage and scale distributed training workloads effectively. This post works with P4de instances. Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.

To help you get started, we have published a GitHub repository that provides step-by-step instructions for creating an EKS cluster with P4de instances, mounting an FSx for Lustre file system, and running distributed training workloads with NeMo. This guide empowers you to harness the full potential of NeMo and Amazon EKS for your AI model training needs.


About the authors

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring platform. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Wenhan Tan is a Solutions Architect at Nvidia assisting customers to adopt Nvidia AI solutions at large-scale. His work focuses on accelerating deep learning applications and addressing inference and training challenges.

Read More

Governing the ML lifecycle at scale, Part 2: Multi-account foundations

Governing the ML lifecycle at scale, Part 2: Multi-account foundations

Your multi-account strategy is the core of your foundational environment on AWS. Design decisions around your multi-account environment are critical for operating securely at scale. Grouping your workloads strategically into multiple AWS accounts enables you to apply different controls across workloads, track cost and usage, reduce the impact of account limits, and mitigate the complexity of managing multiple virtual private clouds (VPCs) and identities by allowing different teams to access different accounts that are tailored to their purpose.

In Part 1 of this series, Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker, you learned about best practices for operating and governing machine learning (ML) and analytics workloads at scale on AWS. In this post, we provide guidance for implementing a multi-account foundation architecture that can help you organize, build, and govern the following modules: data lake foundations, ML platform services, ML use case development, ML operations, centralized feature stores, logging and observability, and cost and reporting.

We cover the following key areas of the multi-account strategy for governing the ML lifecycle at scale:

  •  Implementing the recommended account and organizational unit structure to provide isolation of AWS resources (compute, network, data) and cost visibility for ML and analytics teams
  • Using AWS Control Tower to implement a baseline landing zone to support scaling and governing data and ML workloads
  • Securing your data and ML workloads across your multi-account environment at scale using the AWS Security Reference Architecture
  • Using the AWS Service Catalog to scale, share, and reuse ML across your multi-account environment and for implementing baseline configurations for networking
  • Creating a network architecture to support your multi-account environment and facilitate network isolation and communication across your multi-tenant environment

Your multi-account foundation is the first step towards creating an environment that enables innovation and governance for data and ML workloads on AWS. By integrating automated controls and configurations into your account deployments, your teams will be able to move quickly and access the resources they need, knowing that they are secure and comply with your organization’s best practices and governance policies. In addition, this foundational environment will enable your cloud operations team to centrally manage and distribute shared resources such as networking components, AWS Identity and Access Management (IAM) roles, Amazon SageMaker project templates, and more.

In the following sections, we present the multi-account foundation reference architectures, discuss the motivation behind the architectural decisions made, and provide guidance for implementing these architectures in your own environment.

Organizational units and account design

You can use AWS Organizations to centrally manage accounts across your AWS environment. When you create an organization, you can create hierarchical groupings of accounts within organizational units (OUs). Each OU is typically designed to hold a set of accounts that have common operational needs or require a similar set of controls.

The recommended OU structure and account structure you should consider for your data and ML foundational environment is based on the AWS whitepaper Organizing Your AWS Environment Using Multiple Accounts. The following diagram illustrates the solution architecture.

Organizing Your AWS Environment Using Multiple Accounts
Only those OUs that are relevant to the ML and data platform have been shown. You can also add other OUs along with the recommended ones. The next sections discuss how these recommended OUs serve your ML and data workloads and the specific accounts you should consider creating within these OUs.

The following image illustrates, respectively, the architecture of the account structure for setting up a multi-account foundation and how it would look like in AWS Organizations once implemented .

Recommended OUs

The recommended OUs include Security, Infrastructure, Workloads, Deployments, and Sandbox. If you deploy AWS Control Tower, which is strongly recommended, it creates two default OUs: Security and Sandbox. You should use these default OUs and create the other three. For instructions, refer to Create a new OU.

Security OU

The Security OU stores the various accounts related to securing your AWS environment. This OU and the accounts therein are typically owned by your security team.

You should consider the following initial accounts for this OU:

  •  Security Tooling account – This account houses general security tools as well as those security tools related to your data and ML workloads. For instance, you can use Amazon Macie within this account to help protect your data across all of your organization’s member accounts.
  • Log Archive account – If you deploy AWS Control Tower, this account is created by default and placed within your Security OU. This account is designed to centrally ingest and archive logs across your organization.

Infrastructure OU

Similar to other types of workloads that you can run on AWS, your data and ML workloads require infrastructure to operate correctly. The Infrastructure OU houses the accounts that maintain and distribute shared infrastructure services across your AWS environment. The accounts within this OU will be owned by the infrastructure, networking, or Cloud Center of Excellence (CCOE) teams.

The following are the initial accounts to consider for this OU:

  • Network account – To facilitate a scalable network architecture for data and ML workloads, it’s recommended to create a transit gateway within this account and share this transit gateway across your organization. This will allow for a hub and spoke network architecture that privately connects your VPCs in your multi-account environment and facilitates communication with on-premises resources if needed.
  • Shared Services account – This account hosts enterprise-level shared services such as AWS Managed Microsoft AD and AWS Service Catalog that you can use to facilitate the distribution of these shared services.

Workloads OU

The Workloads OU is intended to house the accounts that different teams within your platform use to create ML and data applications. In the case of an ML and data platform, you’ll use the following accounts:

  • ML team dev/test/prod accounts – Each ML team may have their own set of three accounts for the development, testing, and production stages of the MLOps lifecycle.
  • (Optional) ML central deployments – It’s also possible to have ML model deployments fully managed by an MLOps central team or ML CCOE. This team can handle the deployments for the entire organization or just for certain teams; either way, they get their own account for deployments.
  • Data lake account – This account is managed by data engineering or platform teams. There can be several data lake accounts organized by business domains. This is hosted in the Workloads OU.
  • Data governance account – This account is managed by data engineering or platform teams. This acts as the central governance layer for data access. This is hosted in the Workloads OU.

Deployments OU

The Deployments OU contains resources and workloads that support how you build, validate, promote, and release changes to your workloads. In the case of ML and data applications, this will be the OU where the accounts that host the pipelines and deployment mechanisms for your products will reside. These will include accounts like the following:

  • DevOps account – This hosts the pipelines to deploy extract, transform, and load (ETL) jobs and other applications for your enterprise cloud platform
  • ML shared services account – This is the main account for your platform ML engineers and the place where the portfolio of products related to model development and deployment are housed and maintained

If the same team managing the ML engineering resources is the one taking care of pipelines and deployments, then these two accounts may be combined into one. However, one team should be responsible for the resources in one account; the moment you have different independent teams taking care of these processes, the accounts should be different. This makes sure that a single team is accountable for the resources in its account, making it possible to have the right levels of billing, security, and compliance for each team.

Sandbox OU

The Sandbox OU typically contains accounts that map to an individual or teams within your organization and are used for proofs of concept. In the case of our ML platform, this can be cases of the platform and data scientist teams wanting to create proofs of concept with ML or data services. We recommend using synthetic data for proofs of concept and avoid using production data in Sandbox environments.

AWS Control Tower

AWS Control Tower enables you to quickly get started with the best practices for your ML platform. When you deploy AWS Control Tower, your multi-account AWS environment is initialized according to prescriptive best practices. AWS Control Tower configures and orchestrates additional AWS services, including Organizations, AWS Service Catalog, and AWS IAM Identity Center. AWS Control Tower helps you create a baseline landing zone, which is a well-architected multi-account environment based on security and compliance best practices. As a first step towards initializing your multi-account foundation, you should set up AWS Control Tower.

In the case of our ML platform, AWS Control Tower helps us with four basic tasks and configurations:

  • Organization structure – From the accounts and OUs that we discussed in the previous section, AWS Control Tower provides you with the Security and Sandbox OUs and the Security Tooling and Logging accounts.
  • Account vending – This enables you to effortlessly create new accounts that comply with your organization’s best practices at scale. It allows you to provide your own bootstrapping templates with AWS Service Catalog (as we discuss in the next sections).
  • Access management – AWS Control Tower integrates with IAM Identity Center, providing initial permissions sets and groups for the basic actions in your landing zone.
  • Controls – AWS Control Tower implements preventive, detective, and proactive controls that help you govern your resources and monitor compliance across groups of AWS accounts.

Access and identity with IAM Identity Center

After you establish your landing zone with AWS Control Tower and create the necessary additional accounts and OUs, the next step is to grant access to various users of your ML and data platform. Proactively determining which users will require access to specific accounts and outlining the reasons behind these decisions is recommended. Within IAM Identity Center, the concepts of groups, roles, and permission sets allows you to create fine-grained access for different personas within the platform.

Users can be organized into two primary groups: platform-wide and team-specific user groups. Platform-wide user groups encompass central teams such as ML engineering and landing zone security, and they are allocated access to the platform’s foundational accounts. Team-specific groups operate at the team level, denoted by roles such as team admins and data scientists. These groups are dynamic, and are established for new teams and subsequently assigned to their respective accounts upon provisioning.

The following table presents some example platform-wide groups.

User Group Description Permission Set Accounts
AWSControlTowerAdmins Responsible for managing AWS Control Tower in the landing zone AWSControlTowerAdmins and AWSSecurityAuditors Management account
AWSNetworkAdmins Manages the networking resources of the landing zone NetworkAdministrator Network account
AWSMLEngineers Responsible for managing the ML central resources PowerUserAccess ML shared services account
AWSDataEngineers Responsible for managing the data lake, ETLs and data processes of the platform PowerUserAccess Data lake account

The following table presents examples of team-specific groups.

User Group Description Permission Set Accounts
TeamLead Group for the administrators of the team. AdministratorAccess Team account
DataScientists Group for data scientists. This group is added as an access for the team’s SageMaker domain. DataScientist Team account
MLEngineers The team may have other roles dedicated to certain specific tasks that have a relationship with the matching platform-wide teams. MLEngineering Team account
DataEngineers DataEngineering Team account

AWS Control Tower automatically generates IAM Identity Center groups with permission set relationships for the various landing zone accounts it creates. You can use these preconfigured groups for your platform’s central teams or create new custom ones. For further insights into these groups, refer to IAM Identity Center Groups for AWS Control Tower. The following screenshot shows an example of the AWS Control Tower console, where you can view the accounts and determine which groups have permission on each account.

IAM Identity Center also provides a login page where landing zone users can get access to the different resources, such as accounts or SageMaker domains, with the different levels of permissions that you have granted them.

AWS Security Reference Architecture

The AWS SRA is a holistic set of guidelines for deploying the full complement of AWS security services in a multi-account environment. It can help you design, implement, and manage AWS security services so they align with AWS recommended practices.

To help scale security operations and apply security tools holistically across the organization, it’s recommended to use the AWS SRA to configure your desired security services and tools. You can use the AWS SRA to set up key security tooling services, such as Amazon GuardDuty, Macie, and AWS Security Hub. The AWS SRA allows you to apply these services across your entire multi-account environment and centralize the visibility these tools provide. In addition, when accounts get created in the future, you can use the AWS SRA to configure the automation required to scope your security tools to these new accounts.

The following diagram depicts the centralized deployment of the AWS SRA.

Scale your ML workloads with AWS Service Catalog

Within your organization, there will likely be different teams corresponding to different business units. These teams will have similar infrastructure and service needs, which may change over time. With AWS Service Catalog, you can scale your ML workloads by allowing IT administrators to create, manage, and distribute portfolios of approved products to end-users, who then have access to the products they need in a personalized portal. AWS Service Catalog has direct integrations with AWS Control Tower and SageMaker.

It’s recommended that you use AWS Service Catalog portfolios and products to enhance and scale the following capabilities within your AWS environment:

  • Account vending – The cloud infrastructure team should maintain a portfolio of account bootstrapping products within the shared infrastructure account. These products are templates that contain the basic infrastructure that should be deployed when an account is created, such as VPC configuration, standard IAM roles, and controls. This portfolio can be natively shared with AWS Control Tower and the management account, so that the products are directly used when creating a new account. For more details, refer to Provision accounts through AWS Service Catalog.
  • Analytics infrastructure self-service – This portfolio should be created and maintained by a central analytics team or the ML shared services team. This portfolio is intended to host templates to deploy different sets of analytics products to be used by the platform ML and analytics teams. It is shared with the entire Workloads OU (for more information, see Sharing a Portfolio). Examples of the products include a SageMaker domain configured according to the organization’s best practices or an Amazon Redshift cluster for the team to perform advanced analytics.
  • ML model building and deploying – This capability maps to two different portfolios, which are maintained by the platform ML shared services team:
    • Model building portfolio – This contains the products to build, train, evaluate, and register your ML models across all ML teams. This portfolio is shared with the Workloads OU and is integrated with SageMaker project templates.
    • Model deployment portfolio – This contains the products to deploy your ML models at scale in a reliable and consistent way. It will have products for different deployment types such as real-time inference, batch inference, and multi-model endpoints. This portfolio can be isolated within the ML shared services account by the central ML engineering team for a more centralized ML strategy, or shared with the Workloads OU accounts and integrated with SageMaker project templates to federate responsibility to the individual ML teams.

Let’s explore how we deal with AWS Service Catalog products and portfolios in our platform. Both of the following architectures show an implementation to govern the AWS Service Catalog products using the AWS Cloud Development Kit (AWS CDK) and AWS CodePipeline. Each of the aforementioned portfolios will have its own independent pipeline and code repository. The pipeline synthesizes the AWS CDK service catalog product constructs into actual AWS Service Catalog products and deploys them to the portfolios, which are later made available for its consumption and use. For more details about the implementation, refer to Govern CI/CD best practices via AWS Service Catalog.

The following diagram illustrates the architecture for the account vending portfolio.

The workflow includes the following steps:

  1. The shared infrastructure account is set up with the pipeline to create the AWS Service Catalog portfolio.
  2. The CCOE or central infrastructure team can work on these products and customize them so that company networking and security requirements are met.
  3. You can use the AWS Control Tower Account Factory Customization (AFC) to integrate the portfolio within the account vending process. For more details, see Customize accounts with Account Factory Customization (AFC).
  4. To create a new account from the AFC, we use a blueprint. A blueprint is an AWS CloudFormation template that will be deployed in the newly created AWS account. For more information, see Create a customized account from a blueprint.

The following screenshot shows an example of what account creation with a blueprint looks like.

For the analytics and ML portfolios, the architecture changes the way these portfolios are used downstream, as shown in the following diagram.

The following are the key steps involved in building this architecture:

  1. The ML shared services account is set up and bootstrapped with the pipelines to create the two AWS Service Catalog portfolios.
  2. The ML CCOE or ML engineering team can work on these products and customize them so they’re up to date and cover the main use cases from the different business units.
  3. These portfolios are shared with the OU where the ML dev accounts will be located. For more information about the different options to share AWS Service Catalog portfolios, see Sharing a Portfolio.
  4. Sharing these portfolios with the entire Workloads OU will result in these two portfolios being available for use by the account team as soon as the account is provisioned.

After the architecture has been set up, account admins will see the AWS Service Catalog portfolios and ML workload account after they log in. The portfolios are ready to use and can get the team up to speed quickly.

Network architecture

In our ML platform, we are considering two different major logical environments for our workloads: production and pre-production environments with corporate connectivity, and sandbox or development iteration accounts without corporate connectivity. These two environments will have different permissions and requirements when it comes to connectivity.

As your environment in AWS scales up, inter-VPC connectivity and on-premises VPC connectivity will need to scale in parallel. By using services such as Amazon Virtual Private Cloud (Amazon VPC) and AWS Transit Gateway, you can create a scalable network architecture that is highly available, secure, and compliant with your company’s best practices. You can attach each account to its corresponding network segment.

For simplicity, we create a transit gateway within the central network account for our production workloads; this will resemble a production network segment. This will create a hub and spoke VPC architecture that will allow our production accounts to do the following:

  • Enable inter-VPC communication between the different accounts.
  • Inspect traffic with centralized egress or ingress to the network segment.
  • Provide the environments with connectivity to on-premises data stores.
  • Create a centralized VPC endpoints architecture to reduce networking costs while maintaining private network compliance. For more details, see Centralized access to VPC private endpoints.

For more information about these type of architectures, refer to Building a Scalable and Secure Multi-VPC AWS Network Infrastructure.

The following diagram illustrates the recommended architecture for deploying your transit gateways and creating attachments to the VPCs within your accounts. Anything considered a production environment, whether it’s a workload or shared services account, is connected to the corporate network, while dev accounts have direct internet connectivity to speed up development and exploring of new features.

At a high level, this architecture allows you to create different transit gateways within your network account for your desired AWS Regions or environments. Scalability is provided through the account vending functionality of AWS Control Tower, which deploys a CloudFormation stack to the accounts containing a VPC and the required infrastructure to connect to the environment’s corresponding network segment. For more information about this approach, see the AWS Control Tower Guide for Extending Your Landing Zone.

With this approach, whenever a team needs a new account, the platform team just needs to know whether this will be an account with corporate network connectivity or not. Then the corresponding blueprint is selected to bootstrap the account with, and the account is created. If it’s a corporate network account, the VPC will come with an attachment to the production transit gateway.

Conclusion

In this post, we discussed best practices for creating a multi-account foundation to support your analytics and ML workloads and configuring controls to help you implement governance early in your ML lifecycle. We provided a baseline recommendation for OUs and accounts you should consider creating using AWS Control Tower and blueprints. In addition, we showed how you can deploy security tools at scale using the AWS SRA, how to configure IAM Identity Center for centralized and federated access management, how to use AWS Service Catalog to package and scale your analytics and ML resources, and a best practice approach for creating a hub and spoke network architecture.

Use this guidance to get started in the creation of your own multi-account environment for governing your analytics and ML workloads at scale, and make sure you subscribe to the AWS Machine Learning Blog to receive updates regarding additional blog posts within this series.


About the authors

Alberto Menendez is a DevOps Consultant in Professional Services at AWS. He helps accelerate customers’ journeys to the cloud and achieve their digital transformation goals. In his free time, he enjoys playing sports, especially basketball and padel, spending time with family and friends, and learning about technology.

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his three-year old sheep-a-doodle!

Liam Izar is Solutions Architect at AWS, where he helps customers work backward from business outcomes to develop innovative solutions on AWS. Liam has led multiple projects with customers migrating, transforming, and integrating data to solve business challenges. His core area of expertise includes technology strategy, data migrations, and machine learning. In his spare time, he enjoys boxing, hiking, and vacations with the family.

Read More

Video auto-dubbing using Amazon Translate, Amazon Bedrock, and Amazon Polly

Video auto-dubbing using Amazon Translate, Amazon Bedrock, and Amazon Polly

This post is co-written with MagellanTV and Mission Cloud. 

Video dubbing, or content localization, is the process of replacing the original spoken language in a video with another language while synchronizing audio and video. Video dubbing has emerged as a key tool in breaking down linguistic barriers, enhancing viewer engagement, and expanding market reach. However, traditional dubbing methods are costly (about $20 per minute with human review effort) and time consuming, making them a common challenge for companies in the Media & Entertainment (M&E) industry. Video auto-dubbing that uses the power of generative artificial intelligence (generative AI) offers creators an affordable and efficient solution.

This post shows you a cost-saving solution for video auto-dubbing. We use Amazon Translate for initial translation of video captions and use Amazon Bedrock for post-editing to further improve the translation quality. Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to help you build generative AI applications with security, privacy, and responsible AI.

MagellanTV, a leading streaming platform for documentaries, wants to broaden its global presence through content internationalization. Faced with manual dubbing challenges and prohibitive costs, MagellanTV sought out AWS Premier Tier Partner Mission Cloud for an innovative solution.

Mission Cloud’s solution distinguishes itself with idiomatic detection and automatic replacement, seamless automatic time scaling, and flexible batch processing capabilities with increased efficiency and scalability.

Solution overview

The following diagram illustrates the solution architecture. The inputs of the solution are specified by the user, including the folder path containing the original video and caption file, target language, and toggles for idiom detector and formality tone. You can specify these inputs in an Excel template and upload the Excel file to a designated Amazon Simple Storage Service (Amazon S3) bucket. This will launch the whole pipeline. The final outputs are a dubbed video file and a translated caption file.

image001_video_dubbing.png

We use Amazon Translate to translate the video caption, and Amazon Bedrock to enhance the translation quality and enable automatic time scaling to synchronize audio and video. We use Amazon Augmented AI for editors to review the content, which is then sent to Amazon Polly to generate synthetic voices for the video. To assign a gender expression that matches the speaker, we developed a model to predict the gender expression of the speaker.

In the backend, AWS Step Functions orchestrates the preceding steps as a pipeline. Each step is run on AWS Lambda or AWS Batch. By using the infrastructure as code (IaC) tool, AWS CloudFormation, the pipeline becomes reusable for dubbing new foreign languages.

In the following sections, you will learn how to use the unique features of Amazon Translate for setting formality tone and for custom terminology. You will also learn how to use Amazon Bedrock to further improve the quality of video dubbing.

Why choose Amazon Translate?

We chose Amazon Translate to translate video captions based on three factors.

  • Amazon Translate supports over 75 languages. While the landscape of large language models (LLMs) has continuously evolved in the past year and continues to change, many of the trending LLMs support a smaller set of languages.
  • Our translation professional rigorously evaluated Amazon Translate in our review process and affirmed its commendable translation accuracy. Welocalize benchmarks the performance of using LLMs and machine translations and recommends using LLMs as a post-editing tool.
  • Amazon Translate has various unique benefits. For example, you can add custom terminology glossaries, while for LLMs, you might need fine-tuning that can be labor-intensive and costly.

Use Amazon Translate for custom terminology

Amazon Translate allows you to input a custom terminology dictionary, ensuring translations reflect the organization’s vocabulary or specialized terminology. We use the custom terminology dictionary to compile frequently used terms within video transcription scripts.

Here’s an example. In a documentary video, the caption file would typically display “(speaking in foreign language)” on the screen as the caption when the interviewee speaks in a foreign language. The sentence “(speaking in foreign language)” itself doesn’t have proper English grammar: it lacks the proper noun, yet it’s commonly accepted as an English caption display. When translating the caption into German, the translation also lacks the proper noun, which can be confusing to German audiences as shown in the code block that follows.

## Translate - without custom terminology (default)
import boto3
# Initialize a session of Amazon Translate
translate=boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)
def translate_text(text, source_lang, target_lang):
    result=translate.translate_text(
        Text=text, 
        SourceLanguageCode=source_lang, 
        TargetLanguageCode=target_lang)
    return result.get('TranslatedText')
text="(speaking in a foreign language)"
output=translate_text(text, "en", "de")
print(output)
# Output: (in einer Fremdsprache sprechen)

Because this phrase “(speaking in foreign language)” is commonly seen in video transcripts, we added this term to the custom terminology CSV file translation_custom_terminology_de.csv with the vetted translation and provided it in the Amazon Translate job. The translation output is as intended as shown in the following code.

## Translate - with custom terminology
import boto3
import json
# Initialize a session of Amazon Translate
translate=boto3.client('translate')
with open('translation_custom_terminology_de.csv', 'rb') as ct_file:
    translate.import_terminology(
        Name='CustomTerminology_boto3',
        MergeStrategy='OVERWRITE',
        Description='Terminology for Demo through boto3',
        TerminologyData={
            'File':ct_file.read(),
            'Format':'CSV',
            'Directionality':'MULTI'
        }
    )
text="(speaking in foreign language)"
result=translate.translate_text(
    Text=text,
    TerminologyNames=['CustomTerminology_boto3_2024'], 
    SourceLanguageCode="en",
    TargetLanguageCode="de"
)
print(result['TranslatedText'])
# Output: (Person spricht in einer Fremdsprache)

Set formality tone in Amazon Translate

Some documentary genres tend to be more formal than others. Amazon Translate allows you to define the desired level of formality for translations to supported target languages. By using the default setting (Informal) of Amazon Translate, the translation output in German for the phrase, “[Speaker 1] Let me show you something,” is informal, according to a professional translator.

## Translate - with informal tone (default) 
import boto3
# Initialize a session of Amazon Translate
translate=boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)
def translate_text(text, source_lang,target_lang):
    result=translate.translate_text(
        Text=text, 
        SourceLanguageCode=source_lang, 
        TargetLanguageCode=target_lang)
    return result.get('TranslatedText')
text="[Speaker 1] Let me show you something."
output=translate_text(text, "en", "de")
print(output)
# Output: [Sprecher 1] Lass mich dir etwas zeigen.

By adding the Formal setting, the output translation has a formal tone, which fits the documentary’s genre as intended.

## Translate - with formal tone 
import boto3
# Initialize a session of Amazon Translate
translate=boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)
def translate_text(text, source_lang, target_lang):
    result=translate.translate_text(
        Text=text, 
        SourceLanguageCode=source_lang, 
        TargetLanguageCode=target_lang,
        Settings={'Formality':'FORMAL'})
    return result.get('TranslatedText')
text="[Speaker 1] Let me show you something."
output=translate_text(text, "en", "de")
print(output)
# Output: [Sprecher 1] Lassen Sie mich Ihnen etwas zeigen.

Use Amazon Bedrock for post-editing

In this section, we use Amazon Bedrock to improve the quality of video captions after we obtain the initial translation from Amazon Translate.

Idiom detection and replacement

Idiom detection and replacement is vital in dubbing English videos to accurately convey cultural nuances. Adapting idioms prevents misunderstandings, enhances engagement, preserves humor and emotion, and ultimately improves the global viewing experience. Hence, we developed an idiom detection function using Amazon Bedrock to resolve this issue.

You can turn the idiom detector on or off by specifying the inputs to the pipeline. For example, for science genres that have fewer idioms, you can turn the idiom detector off. While, for genres that have more casual conversations, you can turn the idiom detector on. For a 25-minute video, the total processing time is about 1.5 hours, of which about 1 hour is spent on video preprocessing and video composing. Turning the idiom detector on only adds about 5 minutes to the total processing time.

We have developed a function bedrock_api_idiom to detect and replace idioms using Amazon Bedrock. The function first uses Amazon Bedrock LLMs to detect idioms in the text and then replace them. In the example that follows, Amazon Bedrock successfully detects and replaces the input text “well, I hustle” to “I work hard,” which can be translated correctly into Spanish by using Amazon Translate.

## A rare idiom is well-detected and rephrased by Amazon Bedrock 
text_rephrased=bedrock_api_idiom(text)
print(text_rephrased)
# Output: I work hard
response=translate_text(text_rephrased, "en", "es-MX")
print(response)
# Output: yo trabajo duro
response=translate_text(response, "es-MX", "en")
print(response)
# Output: I work hard

Sentence shortening

Third-party video dubbing tools can be used for time-scaling during video dubbing, which can be costly if done manually. In our pipeline, we used Amazon Bedrock to develop a sentence shortening algorithm for automatic time scaling.

For example, a typical caption file consists of a section number, timestamp, and the sentence. The following is an example of an English sentence before shortening.

Original sentence:

A large portion of the solar energy that reaches our planet is reflected back into space or absorbed by dust and clouds.

image002_video_dubbing.pn

Here’s the shortened sentence using the sentence shortening algorithm. Using Amazon Bedrock, we can significantly improve the video-dubbing performance and reduce the human review effort, resulting in cost saving.

Shortened sentence:

A large part of solar energy is reflected into space or absorbed by dust and clouds.

image003_video_dubbing.pn

Conclusion

This new and constantly developing pipeline has been a revolutionary step for MagellanTV because it efficiently resolved some challenges they were facing that are common within Media & Entertainment companies in general. The unique localization pipeline developed by Mission Cloud creates a new frontier of opportunities to distribute content across the world while saving on costs. Using generative AI in tandem with brilliant solutions for idiom detection and resolution, sentence length shortening, and custom terminology and tone results in a truly special pipeline bespoke to MagellanTV’s growing needs and ambitions.

If you want to learn more about this use case or have a consultative session with the Mission team to review your specific generative AI use case, feel free to request one through AWS Marketplace.


About the Authors

Na Yu is a Lead GenAI Solutions Architect at Mission Cloud, specializing in developing ML, MLOps, and GenAI solutions in AWS Cloud and working closely with customers. She received her Ph.D. in Mechanical Engineering from the University of Notre Dame.

Max Goff is a data scientist/data engineer with over 30 years of software development experience. A published author, blogger, and music producer he sometimes dreams in A.I.

Marco Mercado is a Sr. Cloud Engineer specializing in developing cloud native solutions and automation. He holds multiple AWS Certifications and has extensive experience working with high-tier AWS partners. Marco excels at leveraging cloud technologies to drive innovation and efficiency in various projects.

Yaoqi Zhang is a Senior Big Data Engineer at Mission Cloud. She specializes in leveraging AI and ML to drive innovation and develop solutions on AWS. Before Mission Cloud, she worked as an ML and software engineer at Amazon for six years, specializing in recommender systems for Amazon fashion shopping and NLP for Alexa. She received her Master of Science Degree in Electrical Engineering from Boston University.

Adrian Martin is a Big Data/Machine Learning Lead Engineer at Mission Cloud. He has extensive experience in English/Spanish interpretation and translation.

Ryan Ries holds over 15 years of leadership experience in data and engineering, over 20 years of experience working with AI and 5+ years helping customers build their AWS data infrastructure and AI models. After earning his Ph.D. in Biophysical Chemistry at UCLA and Caltech, Dr. Ries has helped develop cutting-edge data solutions for the U.S. Department of Defense and a myriad of Fortune 500 companies.

Andrew Federowicz is the IT and Product Lead Director for Magellan VoiceWorks at MagellanTV. With a decade of experience working in cloud systems and IT in addition to a degree in mechanical engineering, Andrew designs builds, deploys, and scales inventive solutions to unique problems. Before Magellan VoiceWorks, Andrew architected and built the AWS infrastructure for MagellanTV’s 24/7 globally available streaming app. In his free time, Andrew enjoys sim racing and horology.

Qiong Zhang, PhD, is a Sr. Partner Solutions Architect at AWS, specializing in AI/ML. Her current areas of interest include federated learning, distributed training, and generative AI. She holds 30+ patents and has co-authored 100+ journal/conference papers. She is also the recipient of the Best Paper Award at IEEE NetSoft 2016, IEEE ICC 2011, ONDM 2010, and IEEE GLOBECOM 2005.

Cristian Torres is a Sr. Partner Solutions Architect at AWS. He has 10 years of experience working in technology performing several roles such as: Support Engineer, Presales Engineer, Sales Specialist and Solutions Architect. He works as a generalist with AWS services focusing on Migrations to help strategic AWS Partners develop successfully from a technical and business perspective.

Read More

How Mixbook used generative AI to offer personalized photo book experiences

How Mixbook used generative AI to offer personalized photo book experiences

This post is co-written with Vlad Lebedev and DJ Charles from Mixbook.

Mixbook is an award-winning design platform that gives users unrivaled creative freedom to design and share one-of-a-kind stories, transforming the lives of more than six million people. Today, Mixbook is the #1 rated photo book service in the US with 26 thousand five-star reviews.

Mixbook is empowering users to share their stories with creativity and confidence. Their mission is to assist users in celebrating the beautiful moments of their lives. Mixbook aims to foster the profound connections between users and their loved ones through sharing of their stories in both physical and digital mediums.

Years ago, Mixbook undertook a strategic initiative to transition their operational workloads to Amazon Web Services (AWS), a move that has continually yielded significant advantages. This pivotal decision has been instrumental in propelling them towards fulfilling their mission, ensuring their system operations are characterized by reliability, superior performance, and operational efficiency.

In this post we show you how Mixbook used generative artificial intelligence (AI) capabilities in AWS to personalize their photo book experiences—a step towards their mission.

Business Challenge

In today’s digital world, we have a lot of pictures that we take and share with our friends and family. Let’s consider a scenario where we have hundreds of photos from a recent family vacation, and we want to create a coffee-table photo-book to make it memorable. However, choosing the best pictures from the lot and describing them with captions can take a lot of time and effort. As we all know, a picture’s worth a thousand words, which is why trying to sum up a moment with a caption of just six to ten words can be so challenging. Mixbook really gets the problem, and they’re here to fix it.

Solution

Mixbook Smart Captions is the magical solution to the caption conundrum. It doesn’t only interpret user photos; it also adds a sprinkle of creativity, making the stories pop.

Most importantly, Smart Captions doesn’t fully automate the creative process. Instead, it provides a creative partner to enable the user’s own storytelling to imbue a book with personal flourishes. Whether it’s a selfie or a scenic shot, the goal is to make sure users’ photos speak volumes, effortlessly.

Architecture overview

The implementation of the system involves three primary components:

  • Data intake
  • Information inference
  • Creative synthesis

Caption generation is heavily reliant on the inference process, because the quality and meaningfulness of the comprehension process output directly influence the specificity and personalization of the caption generation. The following is the data flow diagram of the caption generation process., which is described in the text that follows.

Data intake

A user uploads photos into Mixbook. The raw photos are stored in Amazon Simple Storage Service (Amazon S3).

The data intake process involves three macro components: Amazon Aurora MySQL-Compatible Edition, Amazon S3, and AWS Fargate for Amazon ECS. Aurora MySQL serves as the primary relational data storage solution for tracking and recording media file upload sessions and their accompanying metadata. It offers flexible capacity options, ranging from serverless on one end to reserved provisioned instances for predictable long-term use on the other. S3, in turn, provides efficient, scalable, and secure storage for the media file objects themselves. Its storage classes enable the maintenance of recent uploads in a warm state for low-latency access, while older objects can be transitioned to Amazon S3 Glacier tiers, thus minimizing storage expenses over time. Amazon Elastic Container Registry (Amazon ECS), when used in conjunction with the low-maintenance compute environment of AWS Fargate, forms a convenient orchestrator for containerized workloads, bringing all components together seamlessly.

Inference

The comprehension phase extracts essential contextual and semantic elements from the input, including image descriptions, temporal and spatial data, facial recognition, emotional sentiment, and labels. Among these, the image descriptions generated by a computer vision model offer the most fundamental understanding of the captured moments. Amazon Rekognition delivers precise detection of faces’ bounding boxes and emotional expressions. Face detection is crucial for optimal automatic photo placement and cropping, while emotion recognition allows for more effective story tone adjustments. The detected face bounding boxes on the photos are primarily used for optimal automatic photo placement and cropping. The emotions are used to help select a better tone to make it funnier or more nostalgic (for example). Furthermore, Amazon Rekognition enhances safety by identifying potentially objectionable content.

The inference pipeline is powered by an AWS Lambda-based multi-step architecture, which maximizes cost-efficiency and elasticity by running independent image analysis steps in parallel. AWS Step Functions enables the synchronization and ordering of interdependent steps.

The image captions are generated by an Amazon SageMaker inference endpoint, which is enhanced by an Amazon ElastiCache for Redis-powered buffer. The buffer was implemented after benchmarking the captioning model’s performance. The benchmarking revealed that the model performed optimally when processing batches of images, but underperformed when analyzing individual images.

Generation

The caption-generating mechanism behind the writing assistant feature is what turns Mixbook Studio into a natural language story-crafting tool. Powered by a Llama language model, the assistant initially used carefully engineered prompts created by AI experts. However, the Mixbook Storyarts team sought more granular control over the style and tone of the captions, leading to a diverse team that included an Emmy-nominated scriptwriter reviewing, adjusting, and adding unique handcrafted examples. This resulted in a process of fine-tuning the model, moderating modified responses, and deploying approved models for experimental and public releases. After inference, three captions are created and stored in Amazon Relational Database Service (Amazon RDS).

The following image shows the Mixbook Smart Captions feature in Mixbook Studio.

Benefits

Mixbook implemented this solution to provide new features to their customers. It provided an improved user experience with operational efficiency.

User experience

  • Enhanced storytelling: Captures the users’ emotions and experiences, now beautifully expressed through captions that are heartfelt.
  • User delight: Adds an element of surprise with captions that aren’t just accurate, but also delightful and imaginative. A delighted user Hanie U says “I hope there are more captions experiences released in the future.” Another user, Megan P. says, “It worked great!” Users can also edit the generated captions.
  • Time efficiency: Nobody has the time to struggle with captions. The feature saves precious time while making user stories shine bright.
  • Safety and correctness: The captions were generated responsibly, leveraging the guard-rails to ensure content moderation and relevancy.

System

  • Elasticity and scalability of Lambda
  • Comprehensible workflow orchestration with Step Functions
  • Variety of base models from SageMaker and tuning capabilities for maximum control

As a result of their improved user delight, Mixbook has been named as an official honoree of the Webby Awards in 2024 for Apps & Software Best Use of AI & Machine Learning.

“AWS enables us to scale the innovations our customers love most. And now, with the new AWS generative AI capabilities, we are able to blow our customers minds with creative power they never thought possible. Innovations like this are why we’ve been partnered with AWS since the beta in 2006.”

– Andrew Laffoon, CEO, Mixbook

Conclusion

Mixbook started experimenting with AWS generative AI solutions to augment their existing application in early 2023. They started with a quick proof-of-concept to yield results to show the art of the possible. Continuous development, testing, and integration using AWS breadth of services in compute, storage, analytics, and machine learning allowed them to iterate quickly. After they released the Smart Caption features in beta, they were able to quickly adjust according to real-world usage patterns, and protect the product’s value.

Try out Mixbook Studio to experience the storytelling. To learn more about AWS generative AI solutions, start with Transform your business with generative AI. To hear more from Mixbook leaders, listen to the AWS re:Think Podcast available from Art19, Apple Podcasts, and Spotify.


About the authors

Vlad Lebedev is a Senior Technology Leader at Mixbook. He leads a product-engineering team responsible for transforming Mixbook into a place for heartfelt storytelling. He draws on over a decade of hands-on experience in web development, system design, and data engineering to drive elegant solutions for complex problems. Vlad enjoys learning about both contemporary and ancient cultures, their histories, and languages.

DJ Charles is the CTO at Mixbook. He has enjoyed a 30-year career architecting interactive and e-commerce designs for top brands. Innovating broadband tech for the cable industry in the ’90s, revolutionizing supply-chain processes in the 2000s, and advancing environmental tech at Perillon led to global real-time bidding platforms for brands like Sotheby’s & eBay. Beyond tech, DJ loves learning new musical instruments, the art of songwriting, and deeply engages in music production & engineering in his spare time.

Malini Chatterjee is a Senior Solutions Architect at AWS. She provides guidance to AWS customers on their workloads across a variety of AWS technologies. She brings a breadth of expertise in Data Analytics and Machine Learning. Prior to joining AWS, she was architecting data solutions in financial industries. She is very passionate about semi-classical dancing and performs in community events. She loves traveling and spending time with her family.

Jessica Oliveira is an Account Manager at AWS who provides guidance and support to Commercial Sales in Northern California. She is passionate about building strategic collaborations to help ensure her customers’ success. Outside of work, she enjoys traveling, learning about different languages and cultures, and spending time with her family.

Read More

Using Agents for Amazon Bedrock to interactively generate infrastructure as code

Using Agents for Amazon Bedrock to interactively generate infrastructure as code

In the diverse toolkit available for deploying cloud infrastructure, Agents for Amazon Bedrock offers a practical and innovative option for teams looking to enhance their infrastructure as code (IaC) processes. Agents for Amazon Bedrock automates the prompt engineering and orchestration of user-requested tasks. After being configured, an agent builds the prompt and augments it with your company-specific information to provide responses back to the user in natural language.

This solution shows how Amazon Bedrock agents can be configured to accept cloud architecture diagrams, automatically analyze them, and generate Terraform or AWS CloudFormation templates. This solution uses Retrieval Augmented Generation (RAG) to ensure the generated scripts adhere to organizational needs and industry standards. A key feature is the agent’s ability to dynamically interact with users. During the IaC generation process, Amazon Bedrock agents actively probe for additional information by analyzing the provided diagrams and querying the user to fill any gaps. This interaction allows for a more tailored and precise IaC configuration.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

In this blog post, we explore how Agents for Amazon Bedrock can be used to generate customized, organization standards-compliant IaC scripts directly from uploaded architecture diagrams. This will help accelerate deployments, reduce errors, and ensure adherence to security guidelines.

Solution overview

Before we explore the deployment process, let’s walk through the key steps of the architecture as illustrated in Figure 1.

Figure 1 : High level overview of creating Infrastructure as Code from architecture diagram

  1. Initial Input through the Amazon Bedrock chat console: The user begins by entering the name of their Amazon Simple Storage Service (Amazon S3) bucket and the object (key) name where the architecture diagram is stored into the Amazon Bedrock chat console. For instance, if an architecture diagram is saved as s3://testbucket/architecturediagram.png, the user will enter testbucket as the S3 bucket name and architecturediagram.png as the object name.
  2. Diagram analysis and query generation: The Amazon Bedrock agent forwards the architecture diagram location to an action group that invokes an AWS Lambda. This function retrieves the architecture diagram from the specified S3 bucket, analyzes it using the Amazon Bedrock model, and produces a summary of the diagram. It also generates questions regarding any missing components, dependencies, or parameter values that are needed to create IaC for AWS services. This detailed response is then sent back to the agent.
  3. Interaction and user confirmation: The agent displays the generated questions to the user and records their responses. Next, the agent provides a comprehensive summary of the architecture diagram along with additional inputs provided by the user. Users then have the opportunity to approve this configuration or suggest any necessary adjustments. On receiving confirmation from the user, the agent passes this information to the second action group to generate IaC.
  4. IaC generation and deployment: The second action group invokes a Lambda function that processes the user’s input data along with organization-specific coding guidelines from Knowledge Bases for Amazon Bedrock to create the IaC. After being generated, the IaC is automatically pushed to a designated GitHub repository.

Prerequisites

You should have the following:

Deployment steps

The solution can be used to create IaC (using Terraform or CloudFormation) by inputting the architecture diagram. For the purpose of this blog post, we focus on creating Terraform IaC. There are four steps to deploy the solution.

Step 1: Configure an Amazon Bedrock knowledge base: Configuring a knowledge base (KB) enables you to access information about organization standard Terraform modules. Follow these steps to set up your KB:

  1. Sign in and go to the AWS Management Console for Amazon Bedrock. Go directly to the Knowledge Base section. This is your starting point for creating a new KB.
  2. Enter a clear and descriptive name that reflects the purpose of your KB, such as Terraform KB.
  3. Assign a pre-configured IAM role with the necessary permissions. It’s typically best to let Amazon Bedrock create this role for you to ensure it has the correct permissions.
  4. Define the data sources by uploading a JSON file to an S3 bucket with encryption enabled for security. This file should contain a structured list of AWS services and Terraform modules. For the JSON structure, use the example provided in the repository.
  5. Choose the default embeddings model. For most use cases, the Amazon Bedrock Titan G1 Embeddings – Text model will suffice. It’s pre-configured and ready to use, simplifying the process.
  6. Use the managed vector store to allow Amazon Bedrock to create and manage the vector store for you in Amazon OpenSearch Service.
  7. Select the KB and in the Data source section, choose Sync to begin data ingestion. When data ingestion completes, a green success banner appears if it is successful.
  8. Double-check all entered information for accuracy. Pay special attention to the S3 bucket URI and IAM role details.

Step 2: Configure the Bedrock agent:

  1. Open the Amazon Bedrock console, select Agents in the left navigation panel, then choose Create Agent.
  2. Enter agent details including agent name and description (optional).
  3. Next, grant the agent permissions to AWS services through the IAM service role. This gives your agent access to required services, such as Lambda.
  4. Select a foundation model from Amazon Bedrock (for example, Anthropic Claude 3 Sonnet).
  5. To create Terraform code using Agents for Amazon Bedrock, attach the following instruction to the agent:

“Assist users in creating IaC for provided architecture diagram. Ask user for S3 bucket name and object name where the diagram is stored. Upon receiving the information, run analysis-query action group. Give structured summary and ask user only the questions that are received from action group response. Take the answers from the user and give detailed summary to the user. Take approval from user. When approved, give all that information to final draft along with S3 bucket name, object name as input for the iac-deployment action group and run the action group.”

Step 3: Configuring agent action groups: After initial agent configuration and adding the above instruction to the agent, there are two actions that need to be added to the agent to create Terraform IaC by passing an architecture diagram.

  1. Create an action group linked to a Lambda function (for creating a Lambda function, see Getting started with Lambda) that is designed to analyze the architecture diagram and generates questions related to any missing components, dependencies, or parameter values necessary for IaC creation of AWS services. This group is invoked by the agent following the user’s input of S3 bucket and object details. The responses are then relayed back to the agent, which conducts an interactive session to collect any missing information from the user. See Lambda code and OpenAPI-schema in the repository.
  2. Establish a second action group tied to a different Lambda function responsible for creating the Terraform code and uploading it to a GitHub repository. This group is invoked only after the user has reviewed and approved the infrastructure configuration. See Lambda code and OpenAPI-schema in the repository.

Step 4: Add the action groups to the agent:

  1. Assign a descriptive name to each action group and detail their functions in the description fields. This helps clarify the purpose of each group within the workflow.
  2. For each action group, select the appropriate Lambda functions that you set up previously. These functions run the business logic required when an action is invoked. Make sure to choose the correct version of each Lambda function. For additional details, see the section on Action Group Lambda Functions.
  3. Provide the Amazon S3 URI that links to the API schema for each action group. This schema should include the API’s description, structure, and parameters. The API is crucial for managing the workflow, such as receiving user inputs, invoking Lambda functions to run the process, validating inputs, initiating Terraform module creation, and monitoring the provisioning status. For further guidance, see the section on Action Group OpenAPI Schemas.

The following screenshot shows an example of the user interaction with Agents for Amazon Bedrock

The following screenshot shows an example Terraform output

Clean up

The services used in this demonstration can incur costs. Complete the following steps to clean up your resources:

  1. Delete the Lambda functions if they’re no longer required.
  2. Delete action groups and Amazon Bedrock agent that were created.
  3. Empty and delete the S3 bucket used for storing the architecture diagram.
  4. Remove the generated Terraform scripts from the GitHub repo.
  5. Delete the Amazon Bedrock knowledge base Bedrock if it’s no longer needed.

Conclusion

Agents for Amazon Bedrock uses generative AI to transform architecture diagrams into compliant infrastructure as code (IaC) scripts for AWS deployments, such as Terraform and AWS CloudFormation. This capability is a crucial tool for engineers transitioning to the cloud, speeding up the cloud adoption process while ensuring that deployments adhere to established best practices from the start.

Through the interactive features of Agents for Amazon Bedrock, the automation of IaC generation not only streamlines the initial set up but also significantly improves ongoing operations like infrastructure management. Although this post concentrates on IaC creation, the interactive capabilities of Agents for Amazon Bedrock can be used across various AWS services, providing a dynamic and comprehensive solution for managing and optimizing cloud infrastructure.

Are you ready to streamline your cloud deployment process with the generative AI of Amazon Bedrock? Start by delving into the Amazon Bedrock User Guide to see how it can facilitate your organization’s transition to the cloud. For specialized assistance, consider engaging with AWS Professional Services to maximize the efficiency and benefits of using Amazon Bedrock. Embrace the potential for a swift, secure, and efficient cloud transformation with Amazon Bedrock. Take the first step today and discover how using generative AI can revolutionize your approach to cloud infrastructure.


About the Author

Akhil Raj Yallamelli is a Cloud Infrastructure Architect at AWS, specializing in optimizing cloud infrastructures for enhanced data security and cost efficiency. He skillfully integrates technical solutions with business strategies to create scalable, reliable, and secure cloud environments. Akhil builds technical solutions focusing on customer business outcomes, incorporating generative AI (Gen AI) technologies to drive innovation. With deep expertise in AWS and a strong background in DevOps methodologies throughout the software development life cycle (SDLC), Akhil leads critical implementation and migration projects. He holds an MS degree in Computer Science. Outside of his professional work, Akhil enjoys watching and playing sports.

Ebbey Thomas specializes in strategizing and developing custom AWS Landing Zones with a focus on leveraging Generative AI to enhance cloud infrastructure automation. In his role at AWS Professional Services, Ebbey’s expertise is central to architecting solutions that streamline cloud adoption, ensuring a secure and efficient operational framework for AWS users. He is known for his innovative approach to cloud challenges and his commitment to driving forward the capabilities of cloud services.

Read More

Improve RAG accuracy with fine-tuned embedding models on Amazon SageMaker

Improve RAG accuracy with fine-tuned embedding models on Amazon SageMaker

Retrieval Augmented Generation (RAG) is a popular paradigm that provides additional knowledge to large language models (LLMs) from an external source of data that wasn’t present in their training corpus.

RAG provides additional knowledge to the LLM through its input prompt space and its architecture typically consists of the following components:

  • Indexing: Prepare a corpus of unstructured text, parse and chunk it, and then, embed each chunk and store it in a vector database.
  • Retrieval: Retrieve context relevant to answering a question from the vector database using vector similarity. Use prompt engineering to provide this additional context to the LLM along with the original question. The LLM will then use the original question and the context from the vector database to generate an answer based on data that wasn’t part of its training corpus.

Challenges in RAG accuracy

Pre-trained embedding models are typically trained on large, general-purpose datasets like Wikipedia or web-crawl data. While these models capture a broad range of semantic relationships and can generalize well across various tasks, they might struggle to accurately represent domain-specific concepts and nuances. This limitation can lead to suboptimal performance when using these pre-trained embeddings for specialized tasks or domains, such as legal, medical, or technical domains. Furthermore, pre-trained embeddings might not effectively capture the contextual relationships and nuances that are specific to a particular task or domain. For example, in the legal domain, the same term can have different meanings or implications depending on the context, and these nuances might not be adequately represented in a general-purpose embedding model.

To address the limitations of pre-trained embeddings and improve the accuracy of RAG systems for specific domains or tasks, it’s essential to fine tune the embedding model on domain-specific data. By fine tuning the model on data that is representative of the target domain or task, the model can learn to capture the relevant semantics, jargon, and contextual relationships that are crucial for that domain.

Domain-specific embeddings can significantly improve the quality of vector representations, leading to more accurate retrieval of relevant context from the vector database. This, in turn, enhances the performance of the RAG system in terms of generating more accurate and relevant responses.

This post demonstrates how to use Amazon SageMaker to fine tune a Sentence Transformer embedding model and deploy it with an Amazon SageMaker Endpoint. The code from this post and more examples are available in the GitHub repo. For more information about fine tuning Sentence Transformer, see Sentence Transformer training overview.

Fine tuning embedding models using SageMaker

SageMaker is a fully managed machine learning service that simplifies the entire machine learning workflow, from data preparation and model training to deployment and monitoring. It provides a seamless and integrated environment that abstracts away the complexities of infrastructure management, allowing developers and data scientists to focus solely on building and iterating their machine learning models.

One of the key strengths of SageMaker is its native support for popular open source frameworks such as TensorFlow, PyTorch, and Hugging Face transformers. This integration enables seamless model training and deployment using these frameworks, their powerful capabilities and extensive ecosystem of libraries and tools.

SageMaker also offers a range of built-in algorithms for common use cases like computer vision, natural language processing, and tabular data, making it easy to get started with pre-built models for various tasks. SageMaker also supports distributed training and hyperparameter tuning, allowing for efficient and scalable model training.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Steps to fine tune embedding models on Amazon SageMaker

In the following sections, we use a SageMaker JupyterLab to walk through the steps of data preparation, creating a training script, training the model, and deploying it as a SageMaker endpoint.

We will fine tune the embedding model sentence-transformers, all-MiniLM-L6-v2, which is an open source Sentence Transformers model fine tuned on a 1B sentence pairs dataset. It maps sentences and paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search. To fine tune it, we will use the Amazon Bedrock FAQs, a dataset of question and answer pairs, using the MultipleNegativesRankingLoss function.

In Losses, you can find the different loss functions that can be used to fine-tune embedding models on training data. The choice of loss function plays a critical role when fine tuning the model. It determines how well our embedding model will work for the specific downstream task.

The MultipleNegativesRankingLoss function is recommended when you only have positive pairs in your training data, for example, only pairs of similar texts like pairs of paraphrases, pairs of duplicate questions, pairs of query and response, or pairs of (source_language and target_language).

In our case, considering that we’re using Amazon Bedrock FAQs as training data, which consists of pairs of questions and answers, the MultipleNegativesRankingLoss function could be a good fit.

The following code snippet demonstrates how to load a training dataset from a JSON file, prepares the data for training, and then fine tunes the pre-trained model. After fine tuning, the updated model is saved.

The EPOCHS variable determines the number of times the model will iterate over the entire training dataset during the fine-tuning process. A higher number of epochs typically leads to better convergence and potentially improved performance but might also increase the risk of overfitting if not properly regularized.

In this example, we have a small training set consisting of only 100 records. As a result, we’re using a high value for the EPOCHS parameter. Typically, in real-world scenarios, you would have a much larger training set. In such cases, the EPOCHS value should be a single- or two-digit number to avoid overfitting the model to the training data.

from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
from torch.utils.data import DataLoader
from sentence_transformers.evaluation import InformationRetrievalEvaluator
import json

def load_data(path):
    """Load the dataset from a JSON file."""
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data

dataset = load_data("training.json")


# Load the pre-trained model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Convert the dataset to the required format
train_examples = [InputExample(texts=[data["sentence1"], data["sentence2"]]) for data in dataset]

# Create a DataLoader object
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=8)

# Define the loss function
train_loss = losses.MultipleNegativesRankingLoss(model)

EPOCHS=100

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=EPOCHS,
    show_progress_bar=True,
)

# Save the fine-tuned model
model.save("opt/ml/model/",safe_serialization=False)

To deploy and serve the fine-tuned embedding model for inference, we create an inference.py Python script that serves as the entry point. This script implements two essential functions: model_fn and predict_fn, as required by SageMaker for deploying and using machine learning models.

The model_fn function is responsible for loading the fine-tuned embedding model and the associated tokenizer. The predict_fn function takes input sentences, tokenizes them using the loaded tokenizer, and computes their sentence embeddings using the fine-tuned model. To obtain a single vector representation for each sentence, it performs mean pooling over the token embeddings followed by normalization of the resulting embedding. Finally, predict_fn returns the normalized embeddings as a list, which can be further processed or stored as required.

%%writefile opt/ml/model/inference.py

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
import os

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def model_fn(model_dir, context=None):
  # Load model from HuggingFace Hub
  tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/model")
  model = AutoModel.from_pretrained(f"{model_dir}/model")
  return model, tokenizer

def predict_fn(data, model_and_tokenizer, context=None):
    # destruct model and tokenizer
    model, tokenizer = model_and_tokenizer
    
    # Tokenize sentences
    sentences = data.pop("inputs", data)
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    
    # return dictonary, which will be json serializable
    return {"vectors": sentence_embeddings[0].tolist()}

After creating the inference.py script, we package it together with the fine-tuned embedding model into a single model.tar.gz file. This compressed file can then be uploaded to an S3 bucket, making it accessible for deployment as a SageMaker endpoint.

import boto3
import tarfile
import os

model_dir = "opt/ml/model"
model_tar_path = "model.tar.gz"

with tarfile.open(model_tar_path, "w:gz") as tar:
    tar.add(model_dir, arcname=os.path.basename(model_dir))
    
s3 = boto3.client('s3')

# Get the region name
session = boto3.Session()
region_name = session.region_name

# Get the account ID from STS (Security Token Service)
sts_client = session.client("sts")
account_id = sts_client.get_caller_identity()["Account"]

model_path = f"s3://sagemaker-{region_name}-{account_id}/model_trained_embedding/model.tar.gz"

bucket_name = f"sagemaker-{region_name}-{account_id}"
s3_key = "model_trained_embedding/model.tar.gz"

with open(model_tar_path, "rb") as f:
    s3.upload_fileobj(f, bucket_name, s3_key)

Finally, we can deploy our fine-tuned model in a SageMaker endpoint.

from sagemaker.huggingface.model import HuggingFaceModel
import sagemaker

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=model_path,  # path to your trained SageMaker model
   role=sagemaker.get_execution_role(),                                            # IAM role with permissions to create an endpoint
   transformers_version="4.26",                           # Transformers version used
   pytorch_version="1.13",                                # PyTorch version used
   py_version='py39',                                    # Python version used
   entry_point="opt/ml/model/inference.py",
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

After the deployment is completed, you can find the deployed SageMaker endpoint in the AWS Management Console for SageMaker by choosing the Inference from the navigation pane, and then choosing Endpoints.

You have multiple options to invoke you endpoint. For example, in your SageMaker JupyterLab, you can invoke it with the following code snippet:

# example request: you always need to define "inputs"
data = {
   "inputs": "Are Agents fully managed?."
}

# request
predictor.predict(data)

It returns the vector containing the embedding of the inputs key:

{'vectors': [0.04694557189941406,
-0.07266131788492203,
-0.058242443948984146,
....,
]}

To illustrate the impact of fine tuning, we can compare the cosine similarity scores between two semantically related sentences using both the original pre-trained model and the fine-tuned model. A higher cosine similarity score indicates that the two sentences are more semantically similar, because their embeddings are closer in the vector space.

Let’s consider the following pair of sentences:

  • What are agents, and how can they be used?
  • Agents for Amazon Bedrock are fully managed capabilities that automatically break down tasks, create an orchestration plan, securely connect to company data through APIs, and generate accurate responses for complex tasks like automating inventory management or processing insurance claims.

These sentences are related to the concept of agents in the context of Amazon Bedrock, although with different levels of detail. By generating embeddings for these sentences using both models and calculating their cosine similarity, we can evaluate how well each model captures the semantic relationship between them.

The original pre-trained model returns a similarity score of only 0.54.

The fine-tuned model returns a similarity score of 0.87.

We can observe how the fine-tuned model was able to identify a much higher semantic similarity between the concepts of agents and Agents for Amazon Bedrock when compared to the pre-trained model. This improvement is attributed to the fine-tuning process, which exposed the model to the domain-specific language and concepts present in the Amazon Bedrock FAQs data, enabling it to better capture the relationship between these terms.

Clean up

To avoid future charges in your account, delete the resources you created in this walkthrough. The SageMaker endpoint and the SageMaker JupyterLab instance will incur charges as long as the instances are active, so when you’re done delete the endpoint and resources that you created while running the walkthrough.

Conclusion

In this blog post, we have explored the importance of fine tuning embedding models to improve the accuracy of RAG systems in specific domains or tasks. We discussed the limitations of pre-trained embeddings, which are trained on general-purpose datasets and might not capture the nuances and domain-specific semantics required for specialized domains or tasks.

We highlighted the need for domain-specific embeddings, which can be obtained by fine tuning the embedding model on data representative of the target domain or task. This process allows the model to capture the relevant semantics, jargon, and contextual relationships that are crucial for accurate vector representations and, consequently, better retrieval performance in RAG systems.

We then demonstrated how to fine tune embedding models on Amazon SageMaker using the popular Sentence Transformers library.

By fine tuning embeddings on domain-specific data using SageMaker, you can unlock the full potential of RAG systems, enabling more accurate and relevant responses tailored to your specific domain or task. This approach can be particularly valuable in domains like legal, medical, or technical fields where capturing domain-specific nuances is crucial for generating high-quality and trustworthy outputs.

This and more examples are available in the GitHub repo. Try it out today using the Set up for single users (Quick setup) on Amazon SageMaker and let us know what you think in the comments.


About the Authors

Ennio Emanuele Pastore is a Senior Architect on the AWS GenAI Labs team. He is an enthusiast of everything related to new technologies that have a positive impact on businesses and general livelihood. He helps organizations in achieving specific business outcomes by using data and AI and accelerating their AWS Cloud adoption journey.

Read More

How BRIA AI used distributed training in Amazon SageMaker to train latent diffusion foundation models for commercial use

How BRIA AI used distributed training in Amazon SageMaker to train latent diffusion foundation models for commercial use

This post is co-written with Bar Fingerman from BRIA AI.

This post explains how BRIA AI trained BRIA AI 2.0, a high-resolution (1024×1024) text-to-image diffusion model, on a dataset comprising petabytes of licensed images quickly and economically. Amazon SageMaker training jobs and Amazon SageMaker distributed training libraries took on the undifferentiated heavy lifting associated with infrastructure management. SageMaker helps you build, train, and deploy machine learning (ML) models for your use cases with fully managed infrastructure, tools, and workflows.

BRIA AI is a pioneering platform specializing in responsible and open generative artificial intelligence (AI) for developers, offering advanced models exclusively trained on licensed data from partners such as Getty Images, DepositPhotos, and Alamy. BRIA AI caters to major brands, animation and gaming studios, and marketing agencies with its multimodal suite of generative models. Emphasizing ethical sourcing and commercial readiness, BRIA AI’s models are source-available, secure, and optimized for integration with various tech stacks. By addressing foundational challenges in data procurement, continuous model training, and seamless technology integration, BRIA AI aims to be the go-to platform for creative AI application developers.

You can also find the BRIA AI 2.0 model for image generation on AWS Marketplace.

This blog post discusses how BRIA AI worked with AWS to address the following key challenges:

  • Achieving out-of-the-box operational excellence for large model training
  • Reducing time-to-train by using data parallelism
  • Maximizing GPU utilization with efficient data loading
  • Reducing model training cost (by paying only for net training time)

Importantly, BRIA AI was able to use SageMaker while keeping the initially used HuggingFace Accelerate (Accelerate) software stack intact. Thus, transitioning to SageMaker training didn’t require changes to BRIA AI’s model implementation or training code. Later, BRIA AI was able to seamlessly evolve their software stack on SageMaker along with their model training.

Training pipeline architecture

Training pipeline architecture

BRIA AI’s training pipeline consists of two main components:

Data preprocessing:

  • Data contributors upload licensed raw image files to BRIA AI’s Amazon Simple Storage Service (Amazon S3) bucket.
  • An image pre-processing pipeline using Amazon Simple Queue Service (Amazon SQS) and AWS Lambda functions generates missing image metadata and packages training data into large webdataset files for later efficient data streaming directly from an S3 bucket, and data sharding across GPUs. See the [Challenge 1] section. Webdataset is a PyTorch implementation therefore it fits well with Accelerate.

Model training:

  • SageMaker distributes training jobs for managing the training cluster and runs the training itself.
  • Streaming data from S3 to the training instances using SageMaker’s FastFile mode.

Pre-training challenges and solutions

Pre-training foundation models is a challenging task. Challenges include cost, performance, orchestration, monitoring, and the engineering expertise needed throughout the weeks-long training process.

The four challenges we faced were:

Challenge 1: Achieving out-of-the-box operational excellence for large model training

To orchestrate the training cluster and recover from failures, BRIA AI relies on SageMaker Training Jobs’ resiliency features. These include cluster health checks, built-in retries, and job resiliency. Before your job starts, SageMaker runs GPU health checks and verifies NVIDIA Collective Communications Library (NCCL) communication on GPU instances, replacing faulty instances (if necessary) to make sure your training script starts running on a healthy cluster of instances. You can also configure SageMaker to automatically retry training jobs that fail with a SageMaker internal server error (ISE). As part of retrying a job, SageMaker will replace instances that encountered unrecoverable GPU errors with fresh instances, reboot the healthy instances, and start the job again. This results in faster restarts and workload completion. By using AWS Deep Learning Containers, the BRIA AI workload benefited from the SageMaker SDK automatically setting the necessary environment variables to tune NVIDIA NCCL AWS Elastic Fabric Adapter (EFA) networking based on well-known best practices. This helps maximize the workload throughput.

To monitor the training cluster, BRIA AI used the built-in SageMaker integration to Amazon CloudWatch logs (applicative logs), and CloudWatch metrics (CPU, GPU, and networking metrics).

Challenge 2: Reducing time-to-train by using data parallelism

BRIA AI needed to train a stable-diffusion 2.0 model from scratch on petabytes-scale licensed image dataset. Training on a single GPU could take few month to complete. To meet deadline requirements, BRIA AI used data parallelism by using a SageMaker training with 16 p4de.24xlarge instances, reducing the total training time to under two weeks. Distributed data parallel training allows for much faster training of large models by splitting data across many devices that train in parallel, while syncing gradients regularly to keep a consistent shared model. It uses the combined computing power of many devices. BRIA AI used a cluster of four p4de.24xlarge instances (8xA100 80GB NVIDIA GPUs) to achieve a throughput of 1.8 it per second for an effective batch size of 2048 (batch=8, bf16, accumulate=2).

p4de.24xlarge instances include 600 GB per second peer-to-peer GPU communication with NVIDIA NVSwitch. 400 gigabits per second (Gbps) instance networking with support for EFA and NVIDIA GPUDirect RDMA (remote direct memory access).

Note: Currently you can use p5.48xlarge instances (8XH100 80GB GPUs) with 3200 Gbps networking between instances using EFA 2.0 (not used in this pre-training by BRIA AI).

Accelerate is a library that enables the same PyTorch code to be run across a distributed configuration with minimal code adjustments.

BRIA AI used Accelerate for small scale training off the cloud. When it was time to scale out training in the cloud, BRIA AI was able to continue using Accelerate, thanks to its built-in integration with SageMaker and Amazon SageMaker distributed data parallel library (SMDDP). SMDDP is purpose built to the AWS infrastructure, reducing communications overhead in two ways:

  • The library performs AllReduce, a key operation during distributed training that’s responsible for a large portion of communication overhead (optimal GPU usage with efficient AllReduce overlapping with a backward pass).
  • The library performs optimized node-to-node communication by fully utilizing the AWS network infrastructure and Amazon Elastic Compute Cloud (Amazon EC2) instance topology (optimal bandwidth use with balanced fusion buffer).

Note that SageMaker training supports many open source distributed training libraries, for example Fully Sharded Data Parallel (FSDP), and DeepSpeed. BRIA AI used FSDP in SageMaker in other training workloads. In this case, by using the ShardingStrategy.SHARD_GRAD_OP feature, BRIA AI was able to achieve an optimal batch size and accelerate their training process.

Challenge 3: Achieving efficient data loading

The BRIA AI dataset included hundreds of millions of images that needed to be delivered from storage onto GPUs for processing. Efficiently accessing this large amount of data across a training cluster presents several challenges:

  • The data might not fit into the storage of a single instance.
  • Downloading the multi-terabyte dataset to each training instance is time consuming while the GPUs sit idle.
  • Copying millions of small image files from Amazon S3 can become a bottleneck because of accumulated roundtrip time of fetching objects from S3.
  • The data needs to be split correctly between instances.

BRIA AI addressed these challenges by using SageMaker fast file input mode, which provided the following out-of-the-box features:

  • Streaming Instead of copying data when training starts, or using an additional distributed file system, we chose to stream data directly from Amazon S3 to the training instances using SageMaker fast file mode. This allows training to start immediately without waiting for downloads. Streaming also reduces the need to fit datasets into instance storage.
  • Data distribution: Fast file mode was configured to shard the dataset files between multiple instances using S3DataDistributionType=ShardedByS3Key.
  • Local file access: Fast file mode provides a local POSIX filesystem interface to data in Amazon S3. This allowed BRIA AI’s data loader to access remote data as if it was local.
  • Packaging files to large containers: Using millions of small image and metadata files is an overhead when streaming data from object storage like Amazon S3. To reduce this overhead, BRIA AI compacted multiple files into large TAR file containers (2–5 GB), which can be efficiently streamed from S3 using fast file mode to the instances. Specifically, BRIA AI used WebDataset for efficient local data loading and used a policy wherein there is no data loading synchronization between instances and each GPU loads random batches through a fixed seed. This policy helps eliminate bottlenecks and maintains fast and deterministic data loading performance.

For more on data loading considerations, see Choose the best data source for your Amazon SageMaker training job blog post.

Challenge 4: Paying only for net training time

Pre-training large language models is not continuous. The model training often requires intermittent stops for evaluation and adjustments. For instance, the model might stop converging and need adjustments, or you might want to pause training to test the model, refine data, or troubleshoot issues. These pauses result in extended periods where the GPU cluster is idle. With SageMaker training jobs, BRIA AI was able to only pay for the duration of their active training time. This allowed BRIA AI to train models at a lower cost and with greater efficiency.

BRIA AI training strategy is composed of three steps for resolution for optimal model convergence:

  1. Initial training on a 256×256 – 32 GPUs cluster
  2. Progressive refinement to a 512×512 – 64 GPUs cluster
  3. Final training on a 1024×1024 – 128 GPUs cluster

In each step, the computing required was different due to applied tradeoffs, such as the batch size per resolution and the upper limit of the GPU and gradient accumulation. The tradeoff is between cost-saving and model coverage.

BRIA AI’s cost calculations were facilitated by maintaining a consistent iteration per second rate, which allowed for accurate estimation of training time. This enabled precise determination of the required number of iterations and calculation of the training compute cost per hour.

BRIA AI training GPU utilization and average batch size time:

  • GPU utilization:  Average is over 98 percent, signifying maximization of GPUs for the whole training cycle and that our data loader is efficiently streaming data at a high rate.
  • Iterations per second :  Training strategy is composed of three steps—Initial training on 256×256, progressive refinement to 512×512, and final training on 1024×1024 resolution for optimal model convergence. For each step, the amount of computing varies because there are tradeoffs that we can apply with different batch sizes per resolution while considering the upper limit of the GPU and gradient accumulation, where the tension is cost-saving against model coverage.

Result examples

Result examples

Prompts used for generating the images
Prompt 1, upper left image: A stylish man sitting casually on outdoor steps, wearing a green hoodie, matching green pants, black shoes, and sunglasses. He is smiling and has neatly groomed hair and a short beard. A brown leather bag is placed beside him. The background features a brick wall and a window with white frames.

Prompt 2, upper right image: A vibrant Indian wedding ceremony. The smiling bride in a magenta saree with gold embroidery and henna-adorned hands sits adorned in traditional gold jewelry. The groom, sitting in front of her, in a golden sherwani and white dhoti, pours water into a ceremonial vessel. They are surrounded by flowers, candles, and leaves in a colorful, festive atmosphere filled with traditional objects.

Prompt 3, lower left image: A wooden tray filled with a variety of delicious pastries. The tray includes a croissant dusted with powdered sugar, a chocolate-filled croissant, a partially eaten croissant, a Danish pastry and a muffin next to a small jar of chocolate sauce, and a bowl of coffee beans, all arranged on a beige cloth.

Prompt 4, lower right image: A panda pouring milk into a white cup on a table with coffee beans, flowers, and a coffee press. The background features a black-and-white picture and a decorative wall piece.

Conclusion

In this post, we saw how Amazon SageMaker enabled BRIA AI to train a diffusion model efficiently, without needing to manually provision and configure infrastructure. By using SageMaker training, BRIA AI was able to reduce costs and accelerate iteration speed, reducing training time with distributed training while maintaining 98 percent GPU utilization, and maximize value per cost. By taking on the undifferentiated heavy lifting, SageMaker empowered BRIA AI’s team to be more productive and deliver innovations faster. The ease of use and automation offered by SageMaker training jobs makes it an attractive option for any team looking to efficiently train large, state-of-the-art models.

To learn more about how SageMaker can help you train large AI models efficiently and cost-effectively, explore the Amazon SageMaker page. You can also reach out to your AWS account team to discover how to unlock the full potential of your large-scale AI initiatives.


About the Authors

Bar FingermanBar Fingerman, Head Of Engineering AI/ML at BRIA AI.

Doron BleibergDoron Bleiberg, Senior Startup Solutions Architect.

Gili Nachum, Principal Gen AI/ML Specialist Solutions ArchitectGili Nachum, Principal Gen AI/ML Specialist Solutions Architect.

Erez ZarumErez Zarum, Startup Solutions Architect,

Read More