Get started quickly with AWS Trainium and AWS Inferentia using AWS Neuron DLAMI and AWS Neuron DLC

Get started quickly with AWS Trainium and AWS Inferentia using AWS Neuron DLAMI and AWS Neuron DLC

Starting with the AWS Neuron 2.18 release, you can now launch Neuron DLAMIs (AWS Deep Learning AMIs) and Neuron DLCs (AWS Deep Learning Containers) with the latest released Neuron packages on the same day as the Neuron SDK release. When a Neuron SDK is released, you’ll now be notified of the support for Neuron DLAMIs and Neuron DLCs in the Neuron SDK release notes, with a link to the AWS documentation containing the DLAMI and DLC release notes. In addition, this release introduces a number of features that help improve user experience for Neuron DLAMIs and DLCs. In this post, we walk through some of the support highlights with Neuron 2.18.

Neuron DLC and DLAMI overview and announcements

The DLAMI is a pre-configured AMI that comes with popular deep learning frameworks like TensorFlow, PyTorch, Apache MXNet, and others pre-installed. This allows machine learning (ML) practitioners to rapidly launch an Amazon Elastic Compute Cloud (Amazon EC2) instance with a ready-to-use deep learning environment, without having to spend time manually installing and configuring the required packages. The DLAMI supports various instance types, including Neuron Trainium and Inferentia powered instances, for accelerated training and inference.

AWS DLCs provide a set of Docker images that are pre-installed with deep learning frameworks. The containers are optimized for performance and available in Amazon Elastic Container Registry (Amazon ECR). DLCs make it straightforward to deploy custom ML environments in a containerized manner, while taking advantage of the portability and reproducibility benefits of containers.

Multi-Framework DLAMIs

The Neuron Multi-Framework DLAMI for Ubuntu 22 provides separate virtual environments for multiple ML frameworks: PyTorch 2.1, PyTorch 1.13, Transformers NeuronX, and TensorFlow 2.10. DLAMI offers you the convenience of having all these popular frameworks readily available in a single AMI, simplifying their setup and reducing the need for multiple installations.

This new Neuron Multi-Framework DLAMI is now the default choice when launching Neuron instances for Ubuntu through the AWS Management Console, making it even faster for you to get started with the latest Neuron capabilities right from the Quick Start AMI list.

Existing Neuron DLAMI support

The existing Neuron DLAMIs for PyTorch 1.13 and TensorFlow 2.10 have been updated with the latest 2.18 Neuron SDK, making sure you have access to the latest performance optimizations and features for both Ubuntu 20 and Amazon Linux 2 distributions.

AWS Systems Manager Parameter Store support

Neuron 2.18 also introduces support in Parameter Store, a capability of AWS Systems Manager, for Neuron DLAMIs, allowing you to effortlessly find and query the DLAMI ID with the latest Neuron SDK release. This feature streamlines the process of launching new instances with the most up-to-date Neuron SDK, enabling you to automate your deployment workflows and make sure you’re always using the latest optimizations.

Availability of Neuron DLC Images in Amazon ECR

To provide customers with more deployment options, Neuron DLCs are now hosted both in the public Neuron ECR repository and as private images. Public images provide seamless integration with AWS ML deployment services such as Amazon EC2, Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (Amazon EKS); private images are required when using Neuron DLCs with Amazon SageMaker.

Updated Dockerfile locations

Prior to this release, Dockerfiles for Neuron DLCs were located within the AWS/Deep Learning Containers repository. Moving forward, Neuron containers can be found in the AWS-Neuron/ Deep Learning Containers repository.

Improved documentation

The Neuron SDK documentation and AWS documentation sections for DLAMI and DLC now have up-to-date user guides about Neuron. The Neuron SDK documentation also includes a dedicated DLAMI section with guides on discovering, installing, and upgrading Neuron DLAMIs, along with links to release notes in AWS documentation.

Using the Neuron DLC and DLAMI with Trn and Inf instances

AWS Trainium and AWS Inferentia are custom ML chips designed by AWS to accelerate deep learning workloads in the cloud.

You can choose your desired Neuron DLAMI when launching Trn and Inf instances through the console or infrastructure automation tools like AWS Command Line Interface (AWS CLI). After a Trn or Inf instance is launched with the selected DLAMI, you can activate the virtual environment corresponding to your chosen framework and begin using the Neuron SDK. If you’re interested in using DLCs, refer to the DLC documentation section in the Neuron SDK documentation or the DLC release notes section in the AWS documentation to find the list of Neuron DLCs with the latest Neuron SDK release. Each DLC in the list includes a link to the corresponding container image in the Neuron container registry. After choosing a specific DLC, please refer to the DLC walkthrough in the next section to learn how to launch scalable training and inference workloads using AWS services like Kubernetes (Amazon EKS), Amazon ECS, Amazon EC2, and SageMaker. The following sections contain walkthroughs for both the Neuron DLC and DLAMI.

DLC walkthrough

In this section, we provide resources to help you use containers for your accelerated deep learning model acceleration on top of AWS Inferentia and Trainium enabled instances.

The section is organized based on the target deployment environment and use case. In most cases, it is recommended to use a preconfigured DLC from AWS. Each DLC is preconfigured to have all the Neuron components installed and is specific to the chosen ML framework.

Locate the Neuron DLC image

The PyTorch Neuron DLC images are published to ECR Public Gallery, which is the recommended URL to use for most cases. If you’re working within SageMaker, use the Amazon ECR URL instead of the Amazon ECR Public Gallery. TensorFlow DLCs are not updated with the latest release. For earlier releases, refer to Neuron Containers. In the following sections, we provide the recommended steps for running an inference or training job in Neuron DLCs.

Prerequisites

Prepare your infrastructure (Amazon EKS, Amazon ECS, Amazon EC2, and SageMaker) with AWS Inferentia or Trainium instances as worker nodes, making sure they have the necessary roles attached for Amazon ECR read access to retrieve container images from Amazon ECR: arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly.

When setting up hosts for Amazon EC2 and Amazon ECS, using Deep Learning AMI (DLAMI) is recommended. An Amazon EKS optimized GPU AMI is recommended to use in Amazon EKS.

You also need the ML job scripts ready with a command to invoke them. In the following steps, we use a single file, train.py, as the ML job script. The command to invoke it is torchrun —nproc_per_node=2 —nnodes=1 train.py.

Extend the Neuron DLC

Extend the Neuron DLC to include your ML job scripts and other necessary logic. As the simplest example, you can have the following Dockerfile:

FROM public.ecr.aws/neuron/pytorch-training-neuronx:2.1.2-neuronx-py310-sdk2.18.2-ubuntu20.04

COPY train.py /train.py

This Dockerfile uses the Neuron PyTorch training container as a base and adds your training script, train.py, to the container.

Build and push to Amazon ECR

Complete the following steps:

  1. Build your Docker image:
    docker build -t <your-repo-name>:<your-image-tag> 

  2. Authenticate your Docker client to your ECR registry:
    aws ecr get-login-password --region <your-region> | docker login --username AWS --password-stdin <your-aws-account-id>.dkr.ecr.<your-region>.amazonaws.com

  3. Tag your image to match your repository:
    docker tag <your-repo-name>:<your-image-tag> <your-aws-account-id>.dkr.ecr.<your-region>.amazonaws.com/<your-repo-name>:<your-image-tag>

  4. Push this image to Amazon ECR:
    docker push <your-aws-account-id>.dkr.ecr.<your-region>.amazonaws.com/<your-repo-name>:<your-image-tag>

You can now run the extended Neuron DLC in different AWS services.

Amazon EKS configuration

For Amazon EKS, create a simple pod YAML file to use the extended Neuron DLC. For example:

apiVersion: v1
kind: Pod
metadata:
  name: training-pod
spec:
  containers:
  - name: training-container
    image: <your-aws-account-id>.dkr.ecr.<your-region>.amazonaws.com/<your-repo-name>:<your-image-tag>
    command: ["torchrun"]
    args: ["--nproc_per_node=2", "--nnodes=1", "/train.py"]
    resources:
      limits:
        aws.amazon.com/neuron: 1
      requests:
        aws.amazon.com/neuron: 1

Use kubectl apply -f <pod-file-name>.yaml to deploy this pod in your Kubernetes cluster.

Amazon ECS configuration

For Amazon ECS, create a task definition that references your custom Docker image. The following is an example JSON task definition:

{
    "family": "training-task",
    "requiresCompatibilities": ["EC2"],
    "containerDefinitions": [
        {
            "command": [
                "torchrun --nproc_per_node=2 --nnodes=1 /train.py"
            ],
            "linuxParameters": {
                "devices": [
                    {
                        "containerPath": "/dev/neuron0",
                        "hostPath": "/dev/neuron0",
                        "permissions": [
                            "read",
                            "write"
                        ]
                    }
                ],
                "capabilities": {
                    "add": [
                        "IPC_LOCK"
                    ]
                }
            },
            "cpu": 0,
            "memoryReservation": 1000,
            "image": "<your-aws-account-id>.dkr.ecr.<your-region>.amazonaws.com/<your-repo-name>:<your-image-tag>",
            "essential": true,
            "name": "training-container",
        }
    ]
}

This definition sets up a task with the necessary configuration to run your containerized application in Amazon ECS.

Amazon EC2 configuration

For Amazon EC2, you can directly run your Docker container:

docker run --name training-job --device=/dev/neuron0 <your-aws-account-id>.dkr.ecr.<your-region>.amazonaws.com/<your-repo-name>:<your-image-tag> "torchrun --nproc_per_node=2 --nnodes=1 /train.py"

SageMaker configuration

For SageMaker, create a model with your container and specify the training job command in the SageMaker SDK:

import sagemaker
from sagemaker.pytorch import PyTorch
role = sagemaker.get_execution_role()
pytorch_estimator = PyTorch(entry_point='train.py',
                            role=role,
                            image_uri='<your-aws-account-id>.dkr.ecr.<your-region>.amazonaws.com/<your-repo-name>:<your-image-tag>',
                            instance_count=1,
                            instance_type='ml.trn1.2xlarge',
                            framework_version='2.1.2',
                            py_version='py3',
                            hyperparameters={"nproc-per-node": 2, "nnodes": 1},
                            script_mode=True)
pytorch_estimator.fit()

DLAMI walkthrough

This section walks through launching an Inf1, Inf2, or Trn1 instance using the Multi-Framework DLAMI in the Quick Start AMI list and getting the latest DLAMI that supports the newest Neuron SDK release easily.

The Neuron DLAMI is a multi-framework DLAMI that supports multiple Neuron frameworks and libraries. Each DLAMI is pre-installed with Neuron drivers and support all Neuron instance types. Each virtual environment that corresponds to a specific Neuron framework or library comes pre-installed with all the Neuron libraries, including the Neuron compiler and Neuron runtime needed for you to get started.

This release introduces a new Multi-Framework DLAMI for Ubuntu 22 that you can use to quickly get started with the latest Neuron SDK on multiple frameworks that Neuron supports as well as Systems Manager (SSM) parameter support for DLAMIs to automate the retrieval of the latest DLAMI ID in cloud automation flows.

For instructions on getting started with the multi-framework DLAMI through the console, refer to Get Started with Neuron on Ubuntu 22 with Neuron Multi-Framework DLAMI. If you want to use the Neuron DLAMI in your cloud automation flows, Neuron also supports SSM parameters to retrieve the latest DLAMI ID.

Launch the instance using Neuron DLAMI

Complete the following steps:

  1. On the Amazon EC2 console, choose your desired AWS Region and choose Launch Instance.
  2. On the Quick Start tab, choose Ubuntu.
  3. For Amazon Machine Image, choose Deep Learning AMI Neuron (Ubuntu 22.04).
  4. Specify your desired Neuron instance.
  5. Configure disk size and other criteria.
  6. Launch the instance.

Activate the virtual environment

Activate your desired virtual environment, as shown in the following screenshot.

After you have activated the virtual environment, you can try out one of the tutorials listed in the corresponding framework or library training and inference section.

Use SSM parameters to find specific Neuron DLAMIs

Neuron DLAMIs support SSM parameters to quickly find Neuron DLAMI IDs. As of this writing, we only support finding the latest DLAMI ID that corresponds to the latest Neuron SDK release with SSM parameter support. In the future releases, we will add support for finding the DLAMI ID using SSM parameters for a specific Neuron release.

You can find the DLAMI that supports the latest Neuron SDK by using the get-parameter command:

aws ssm get-parameter 
--region us-east-1 
--name <dlami-ssm-parameter-prefix>/latest/image_id 
--query "Parameter.Value" 
--output text

For example, to find the latest DLAMI ID for the Multi-Framework DLAMI (Ubuntu 22), you can use the following code:

aws ssm get-parameter 
--region us-east-1 
--name /aws/service/neuron/dlami/multi-framework/ubuntu-22.04/latest/image_id 
--query "Parameter.Value" 
--output text

You can find all available parameters supported in Neuron DLAMIs using the AWS CLI:

aws ssm get-parameters-by-path 
--region us-east-1 
--path /aws/service/neuron 
--recursive

You can also view the SSM parameters supported in Neuron through Parameter Store by selecting the neuron service.

Use SSM parameters to launch an instance directly using the AWS CLI

You can use the AWS CLI to find the latest DLAMI ID and launch the instance simultaneously. The following code snippet shows an example of launching an Inf2 instance using a multi-framework DLAMI:

aws ec2 run-instances 
--region us-east-1 
--image-id resolve:ssm:/aws/service/neuron/dlami/pytorch-1.13/amazon-linux-2/latest/image_id 
--count 1 
--instance-type inf2.48xlarge 
--key-name <my-key-pair> 
--security-groups <my-security-group>

Use SSM parameters in EC2 launch templates

You can also use SSM parameters directly in launch templates. You can update your Auto Scaling groups to use new AMI IDs without needing to create new launch templates or new versions of launch templates each time an AMI ID changes.

Clean up

When you’re done running the resources that you deployed as part of this post, make sure to delete or stop them from running and accruing charges:

  1. Stop your EC2 instance.
  2. Delete your ECS cluster.
  3. Delete your EKS cluster.
  4. Clean up your SageMaker resources.

Conclusion

In this post, we introduced several enhancements incorporated into Neuron 2.18 that improve the user experience and time-to-value for customers working with AWS Inferentia and Trainium instances. Neuron DLAMIs and DLCs with the latest Neuron SDK on the same day as the release means you can immediately benefit from the latest performance optimizations, features, and documentation for installing and upgrading Neuron DLAMIs and DLCs.

Additionally, you can now use the Multi-Framework DLAMI, which simplifies the setup process by providing isolated virtual environments for multiple popular ML frameworks. Finally, we discussed Parameter Store support for Neuron DLAMIs that streamlines the process of launching new instances with the most up-to-date Neuron SDK, enabling you to automate your deployment workflows with ease.

Neuron DLCs are available both private and public ECR repositories to help you deploy Neuron in your preferred AWS service. Refer to the following resources to get started:


About the Authors

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrate cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Sebastian Bustillo is an Enterprise Solutions Architect at AWS. He focuses on AI/ML technologies and has a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI, assisting with the overall process from ideation to production. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the outdoors with his wife.

Ziwen Ning is a software development engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with badminton, swimming and other various sports, and immersing himself in music.

Anant Sharma is a software engineer at AWS Annapurna Labs specializing in DevOps. His primary focus revolves around building, automating and refining the process of delivering software to AWS Trainium and Inferentia customers. Beyond work, he’s passionate about gaming, exploring new destinations and following latest tech developments.

Roopnath Grandhi is a Sr. Product Manager at AWS. He leads large-scale model inference and developer experiences for AWS Trainium and Inferentia AI accelerators. With over 15 years of experience in architecting and building AI based products and platforms, he holds multiple patents and publications in AI and eCommerce.

Marco Punio is a Solutions Architect focused on generative AI strategy, applied AI solutions and conducting research to help customers hyperscale on AWS. He is a qualified technologist with a passion for machine learning, artificial intelligence, and mergers & acquisitions. Marco is based in Seattle, WA and enjoys writing, reading, exercising, and building applications in his free time.

Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect, and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.

Read More

Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3

Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3

This is a guest post co-written with Ratnesh Jamidar and Vinayak Trivedi from Sprinklr.

Sprinklr’s mission is to unify silos, technology, and teams across large, complex companies. To achieve this, we provide four product suites, Sprinklr Service, Sprinklr Insights, Sprinklr Marketing, and Sprinklr Social, as well as several self-serve offerings.

Each of these products are infused with artificial intelligence (AI) capabilities to deliver exceptional customer experience. Sprinklr’s specialized AI models streamline data processing, gather valuable insights, and enable workflows and analytics at scale to drive better decision-making and productivity.

In this post, we describe the scale of our AI offerings, the challenges with diverse AI workloads, and how we optimized mixed AI workload inference performance with AWS Graviton3 based c7g instances and achieved 20% throughput improvement, 30% latency reduction, and reduced our cost by 25–30%.

Sprinklr’s AI scale and challenges with diverse AI workloads

Our purpose-built AI processes unstructured customer experience data from millions of sources, providing actionable insights and improving productivity for customer-facing teams to deliver exceptional experiences at scale. To understand our scaling and cost challenges, let’s look at some representative numbers. Sprinklr’s platform uses thousands of servers that fine-tune and serve over 750 pre-built AI models across over 60 verticals, and run more than 10 billion predictions per day.

To deliver a tailored user experience across these verticals, we deploy patented AI models fine-tuned for specific business applications and use nine layers of machine learning (ML) to extract meaning from data across formats: automatic speech recognition, natural language processing, computer vision, network graph analysis, anomaly detection, trends, predictive analysis, natural language generation, and similarity engine.

The diverse and rich database of models brings unique challenges for choosing the most efficient deployment infrastructure that gives the best latency and performance.

For example, for mixed AI workloads, the AI inference is part of the search engine service with real-time latency requirements. In these cases, the model sizes are smaller, which means the communication overhead with GPUs or ML accelerator instances outweighs their compute performance benefits. Also, inference requests are infrequent, which means accelerators are more often idle and not cost-effective. Therefore, the production instances were not cost-effective for these mixed AI workloads, causing us to look for new instances that offer the right balance of scale and cost-effectiveness.

Cost-effective ML inference using AWS Graviton3

Graviton3 processors are optimized for ML workloads, including support for bfloat16, Scalable Vector Extension (SVE), twice the Single Instruction Multiple Data (SIMD) bandwidth, and 50% more memory bandwidth compared to AWS Graviton2 processors, making them an ideal choice for our mixed workloads. Our goal is to use the latest technologies for efficiency and cost savings, so when AWS released Graviton3-based Amazon Elastic Compute Cloud (Amazon EC2) instances, we were excited to try them in our mixed workloads, especially given our previous Graviton experience. For over 3 years, we have run our search infrastructure on Graviton2-based EC2 instances and our real-time and batched inference workloads on AWS Inferentia ML-accelerated instances, and in both cases we improved latency by 30% and achieved up to 40% price-performance benefits over comparable x86 instances.

To migrate our mixed AI workloads from x86-based instances to Graviton3-based c7g instances, we took a two-step approach. First, we had to experiment and benchmark in order to determine that Graviton3 was indeed the right solution for us. After that was confirmed, we had to perform the actual migration.

First, we started by benchmarking our workloads using the readily available Graviton Deep Learning Containers (DLCs) in a standalone environment. As early adopters of Graviton for ML workloads, it was initially challenging to identify the right software versions and the runtime tunings. During this journey, we collaborated with our AWS technical account manager and the Graviton software engineering teams. We collaborated closely and frequently for the optimized software packages and detailed instructions on how to tune them to achieve optimum performance. In our test environment, we observed 20% throughput improvement and 30% latency reduction across multiple natural language processing models.

After we had validated that Graviton3 met our needs, we integrated the optimizations into our production software stack. The AWS account team assisted us promptly, helping us ramp up quickly to meet our deployment timelines. Overall, migration to Graviton3-based instances was smooth, and it took less than 2 months to achieve the performance improvements in our production workloads.

Results

By migrating our mixed inference/search workloads to Graviton3-based c7g instances from the comparable x86-based instances, we achieved the following:

  • Higher performance – We realized 20% throughput improvement and 30% latency reduction.
  • Reduced cost – We achieved 25–30% cost savings.
  • Improved customer experience – By reducing the latency and increasing throughput, we significantly improved the performance of our products and services, providing the best user experience for our customers.
  • Sustainable AI – Because we saw a higher throughput on the same number of instances, we were able to lower our overall carbon footprint, and we made our products appealing to environmentally conscious customers.
  • Better software quality and maintenance – The AWS engineering team upstreamed all the software optimizations into PyTorch and TensorFlow open source repositories. As a result, our current software upgrade process on Graviton3-based instances is seamless. For example, PyTorch (v2.0+), TensorFlow (v2.9+), and Graviton DLCs come with Graviton3 optimizations and the user guides provide best practices for runtime tuning.

So far, we have migrated PyTorch and TensorFlow based Distil RoBerta-base, spaCy clustering, prophet, and xlmr models to Graviton3-based c7g instances. These models are serving intent detection, text clustering, creative insights, text classification, smart budget allocation, and image download services. These services power our unified customer experience (unified-cxm) platform and conversional AI to allow brands to build more self-serve use cases for their customers. Next, we are migrating ONNX and other larger models to Graviton3-based m7g general purpose and Graviton2-based g5g GPU instances to achieve similar performance improvements and cost savings.

Conclusion

Switching to Graviton3-based instances was quick in terms of engineering time, and resulted in 20% throughput improvement, 30% latency reduction, 25–30% cost savings, improved customer experience, and a lower carbon footprint for our workloads. Based on our experience, we will continue to seek new compute from AWS that will reduce our costs and improve the customer experience.

For further reading, refer to the following:


About the Authors

Sunita Nadampalli is a Software Development Manager at AWS. She leads Graviton software performance optimizations for Machine Learning and HPC workloads. She is passionate about open source software development and delivering high-performance and sustainable software solutions with Arm SoCs.

Gaurav Garg is a Sr. Technical Account Manager at AWS with 15 years of experience. He has a strong operations background. In his role he works with Independent Software Vendors to build scalable and cost-effective solutions with AWS that meet the business requirements. He is passionate about Security and Databases.

Ratnesh Jamidar is a AVP Engineering at Sprinklr with 8 years of experience. He is a seasoned Machine Learning professional with expertise in designing, implementing large-scale, distributed, and highly available AI products and infrastructure.

Vinayak Trivedi is an Associate Director of Engineering at Sprinklr with 4 years of experience in Backend & AI. He is proficient in Applied Machine Learning & Data Science, with a history of building large-scale, scalable and resilient systems.

Read More

How Wiz is empowering organizations to remediate security risks faster with Amazon Bedrock

How Wiz is empowering organizations to remediate security risks faster with Amazon Bedrock

Wiz is a cloud security platform that enables organizations to secure everything they build and run in the cloud by rapidly identifying and removing critical risks. Over 40% of the Fortune 100 trust Wiz’s purpose-built cloud security platform to gain full-stack visibility, accurate risk prioritization, and enhanced business agility. Organizations can connect Wiz in minutes to scan the entire cloud environment without agents and identify the issues representing real risk. Security and cloud teams can then proactively remove risks and harden cloud environments with remediation workflows.

Artificial intelligence (AI) has revolutionized the way organizations function, paving the way for automation and improved efficiency in various tasks that were traditionally manual. One of these use cases is using AI in security organizations to improve security processes and increase your overall security posture. One of the major challenges in cloud security is discerning the best ways to resolve an identified issue in the most effective way to allow you to respond quickly.

Wiz has harnessed the power of generative AI to help organizations remove risks in their cloud environment faster. With Wiz’s new integration with Amazon Bedrock, Wiz customers can now generate guided remediation steps backed by foundation models (FMs) running on Amazon Bedrock to reduce their mean time to remediation (MTTR). Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

“The Wiz and Amazon Bedrock integration enables organizations to further enhance security and improve remediation time by leveraging a choice of powerful foundation models to generate GenAI-powered remediation steps.”

– Vivek Singh, Senior Manager, Product Management-Tech, AWS AI

In this post, we share how Wiz uses Amazon Bedrock to generate remediation guidance for customers that allow them to quickly address security risks in their cloud environment.

Detecting security risks in the cloud with the Wiz Security Graph

Wiz scans cloud environments without agents and runs deep risk assessment across network exposures, vulnerabilities, misconfigurations, identities, data, secrets, and malware. Wiz stores the entire technology stack as well as any risks detected on the Wiz Security Graph, which is backed by Amazon Neptune. Neptune enables Wiz to quickly traverse the graph and understand interconnected risk factors in seconds and how they create an attack path. The Security Graph allows Wiz to surface these critical attack paths in the form of Wiz Issues. For example, a Wiz Issue can alert of a publicly exposed Amazon Elastic Compute Cloud (Amazon EC2) instance that is vulnerable, has admin permissions, and can access sensitive data. The following graph illustrates this attack path.

Attack path

With its Security Graph, Wiz provides customers with pinpoint-accurate alerts on security risks in their environment, reduces the noise faced with traditional security tools, and enables organizations to focus on the most critical risks in their environment.

Remediating cloud risks with guided remediation provided by Amazon Bedrock

To help customers remediate security risks even faster, Wiz uses Amazon Bedrock to analyze metadata from Wiz Issues to generate effective remediation recommendations for customers. With Amazon Bedrock, Wiz combines its deep risk context with cutting-edge FMs to offer enhanced remediation guidance to customers. Customers can scale their remediation workflow and minimize their MTTR by generating straightforward-to-use copy-paste remediation steps that can be directly implemented into the tool of their choice, such as the AWS Command Line Interface (AWS CLI), Terraform, AWS CloudFormation, Pulumi, Go, and Python, or directly using the cloud environment console. The following screenshot showcases an example of the remediation steps generated by Amazon Bedrock for a Wiz Issue.

An example of the remediation steps generated by Amazon Bedrock for a Wiz Issue

Wiz sends a prompt with all the relevant context around a security risk to Amazon Bedrock with instructions on how to present the results based on the target platform. Amazon Bedrock native APIs allow Wiz to select the best model for the use case to answer the request, so when it’s received, it’s parsed and presented in a straightforward manner in the Wiz portal.

To fully operationalize this functionality in production, the Wiz backend has a service running on Amazon Elastic Kubernetes Service (Amazon EKS) that receives the customer request to generate remediation steps, collects the context of the alert the customer wants to remediate, and runs personally identifiable information (PII) redaction on the data to remove any sensitive data. Then, another service running on Amazon EKS pulls the resulting data and sends it to Amazon Bedrock. Such a flow can run in each needed AWS Region supported by Amazon Bedrock to address any compliance needs of their customers. In addition, to secure the usage of Amazon Bedrock with least privilege, Wiz uses AWS permission sets and follows AWS best practices. The Wiz service sending the prompt to Amazon Bedrock has a dedicated AWS Identity and Access Management (IAM) role that allows it to communicate only with the specific Amazon Bedrock service and to only generate those requests. Amazon Bedrock also has restrictions to block any data coming from a non-authorized service. Using these AWS services and the Wiz Security Graph, Wiz helps its customers adopt the most advanced LLMs to speed up the process of addressing complex security issues in a straightforward and secure manner. The following diagram illustrates this architecture.

System architecture

Wiz customers are already experiencing the advantages of our new AI-driven remediation:

“The faster we can remediate security risks, the more we can focus on driving broader strategic initiatives. With Wiz’s AI-powered remediation, we can quickly generate remediation steps that our security team and developers can simply copy-paste to remediate the issue.”

– Rohit Kohli, Deputy CISO, Genpact

By using Amazon Bedrock for generating AI-powered remediation steps, we learnt that security teams are able to minimize the time spent investigating complex risks by 40%, allowing them to focus on mitigating more risks. Furthermore, they are able to empower developers to remediate risks by removing the need for security expertise and providing them with exact steps to take. Not only does Wiz use AI to enhance security processes for customers, but it also makes it effortless for customers to securely adopt AI in their organization with its AI Security Posture Management capabilities, empowering them to protect their AI models while increasing innovation.

Conclusion

Using generative AI for generating enhanced remediation steps marks a significant advancement in the realm of problem-solving and automation. By harnessing the power of AI models powered by Amazon Bedrock, Wiz users can quickly remediate risks with straightforward remediation guidance, reducing manual efforts and improving MTTR. Learn more about Wiz and check out a live demo.


About the Authors

Shaked RotleviShaked Rotlevi is a Technical Product Marketing Manager at Wiz focusing on AI security. Prior to Wiz she was a Solutions Architect at AWS working with public sector customers as well as a Technical Program Manager for a security service team. In her spare time she enjoys playing beach volleyball and hiking.

Itay ArbelItay Arbel is a Lead Product Manager at Wiz. Before joining Wiz, Itay was a product manager at Microsoft and did an MBA in Oxford University, majoring in high tech and emerging technologies. Itay is Wiz’s product lead for the effort of helping organizations securing their AI pipeline and usage of this new emerging technology.

Eitan SelaEitan Sela is a Generative AI and Machine Learning Specialist Solutions Architect at AWS. He works with AWS customers to provide guidance and technical assistance, helping them build and operate Generative AI and Machine Learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Adi AvniAdi Avni is a Senior Solutions Architect at AWS based in Israel. Adi works with AWS ISV customers, helping them to build innovative, scalable and cost-effective solutions on AWS. In his spare time, he enjoys sports and traveling with family and friends.

Read More

Code generation using Code Llama 70B and Mixtral 8x7B on Amazon SageMaker

Code generation using Code Llama 70B and Mixtral 8x7B on Amazon SageMaker

In the ever-evolving landscape of machine learning and artificial intelligence (AI), large language models (LLMs) have emerged as powerful tools for a wide range of natural language processing (NLP) tasks, including code generation. Among these cutting-edge models, Code Llama 70B stands out as a true heavyweight, boasting an impressive 70 billion parameters. Developed by Meta and now available on Amazon SageMaker, this state-of-the-art LLM promises to revolutionize the way developers and data scientists approach coding tasks.

What is Code Llama 70B and Mixtral 8x7B?

Code Llama 70B is a variant of the Code Llama foundation model (FM), a fine-tuned version of Meta’s renowned Llama 2 model. This massive language model is specifically designed for code generation and understanding, capable of generating code from natural language prompts or existing code snippets. With its 70 billion parameters, Code Llama 70B offers unparalleled performance and versatility, making it a game-changer in the world of AI-assisted coding.

Mixtral 8x7B is a state-of-the-art sparse mixture of experts (MoE) foundation model released by Mistral AI. It supports multiple use cases such as text summarization, classification, text generation, and code generation. It is an 8x model, which means it contains eight distinct groups of parameters. The model has about 45 billion total parameters and supports a context length of 32,000 tokens. MoE is a type of neural network architecture that consists of multiple experts” where each expert is a neural network. In the context of transformer models, MoE replaces some feed-forward layers with sparse MoE layers. These layers have a certain number of experts, and a router network selects which experts process each token at each layer. MoE models enable more compute-efficient and faster inference compared to dense models.

Key features and capabilities of Code Llama 70B and Mixtral 8x7B include:

  1. Code generation: These LLMs excel at generating high-quality code across a wide range of programming languages, including Python, Java, C++, and more. They can translate natural language instructions into functional code, streamlining the development process and accelerating project timelines.
  2. Code infilling: In addition to generating new code, they can seamlessly infill missing sections of existing code by providing the prefix and suffix. This feature is particularly valuable for enhancing productivity and reducing the time spent on repetitive coding tasks.
  3. Natural language interaction: The instruct variants of Code Llama 70B and Mixtral 8x7B support natural language interaction, allowing developers to engage in conversational exchanges to develop code-based solutions. This intuitive interface fosters collaboration and enhances the overall coding experience.
  4. Long context support: With the ability to handle context lengths of up to 48 thousand tokens, Code Llama 70B can maintain coherence and consistency over extended code segments or conversations, ensuring relevant and accurate responses. Mixtral 8x7B has a context window of 32 thousand tokens.
  5. Multi-language support: While both of these models excel at generating code, their capabilities extend beyond programming languages. They can also assist with natural language tasks, such as text generation, summarization, and question answering, making them versatile tools for various applications.

Harnessing the power of Code Llama 70B and Mistral models on SageMaker

Amazon SageMaker, a fully managed machine learning service, provides a seamless integration with Code Llama 70B, enabling developers and data scientists to use its capabilities with just a few clicks. Here’s how you can get started:

  1. One-click deployment: Code Llama 70B and Mixtral 8x7B are available in Amazon SageMaker JumpStart, a hub that provides access to pre-trained models and solutions. With a few clicks, you can deploy them and create a private inference endpoint for your coding tasks.
  2. Scalable infrastructure: The SageMaker scalable infrastructure ensures that foundation models can handle even the most demanding workloads, allowing you to generate code efficiently and without delays.
  3. Integrated development environment: SageMaker provides a seamless integrated development environment (IDE) that you can use to interact with these models directly from your coding environment. This integration streamlines the workflow and enhances productivity.
  4. Customization and fine-tuning: While Code Llama 70B and Mixtral 8x7B are powerful out-of-the-box models, you can use SageMaker to fine-tune and customize a model to suit your specific needs, further enhancing its performance and accuracy.
  5. Security and compliance: SageMaker JumpStart employs multiple layers of security, including data encryption, network isolation, VPC deployment, and customizable inference, to ensure the privacy and confidentiality of your data when working with LLMs

Solution overview

The following figure showcases how code generation can be done using the Llama and Mistral AI Models on SageMaker presented in this blog post.

You first deploy a SageMaker endpoint using an LLM from SageMaker JumpStart. For the examples presented in this article, you either deploy a Code Llama 70 B or a Mixtral 8x7B endpoint. After the endpoint has been deployed, you can use it to generate code with the prompts provided in this article and the associated notebook, or with your own prompts. After the code has been generated with the endpoint, you can use a notebook to test the code and its functionality.

Prerequisites

In this section, you sign up for an AWS account and create an AWS Identity and Access Management (IAM) admin user.

If you’re new to SageMaker, we recommend that you read What is Amazon SageMaker?.

Use the following hyperlinks to finish setting up the prerequisites for an AWS account and Sagemaker:

  1. Create an AWS Account: This walks you through setting up an AWS account
  2. When you create an AWS account, you get a single sign-in identity that has complete access to all of the AWS services and resources in the account. This identity is called the AWS account root user.
  3. Signing in to the AWS Management Console using the email address and password that you used to create the account gives you complete access to all of the AWS resources in your account. We strongly recommend that you not use the root user for everyday tasks, even the administrative ones.
  4. Adhere to the security best practices in IAM, and Create an Administrative User and Group. Then securely lock away the root user credentials and use them to perform only a few account and service management tasks.
  5. In the console, go to the SageMaker console andopen the left navigation pane.
    1. Under Admin configurations, choose Domains.
    2. Choose Create domain.
    3. Choose Set up for single user (Quick setup). Your domain and user profile are created automatically.
  6. Follow the steps in Custom setup to Amazon SageMaker to set up SageMaker for your organization.

With the prerequisites complete, you’re ready to continue.

Code generation scenarios

The Mixtral 8x7B and Code Llama 70B models requires an ml.g5.48xlarge instance. SageMaker JumpStart provides a simplified way to access and deploy over 100 different open source and third-party foundation models. In order to deploy an endpoint using SageMaker JumpStart, you might need to request a service quota increase to access an ml.g5.48xlarge instance for endpoint use. You can request service quota increases through the AWS console, AWS Command Line Interface (AWS CLI), or API to allow access to those additional resources.

Code Llama use cases with SageMaker

While Code Llama excels at generating simple functions and scripts, its capabilities extend far beyond that. The models can generate complex code for advanced applications, such as building neural networks for machine learning tasks. Let’s explore an example of using Code Llama to create a neural network on SageMaker. Let us start with deploying the Code Llama Model through SageMaker JumpStart.

  1. Launch SageMaker JumpStart
    Sign in to the console, navigate to SageMaker, and launch the SageMaker domain to open SageMaker Studio. Within SageMaker Studio, select JumpStart in the left-hand navigation menu.
  2. Search for Code Llama 70B
    In the JumpStart model hub, search for Code Llama 70B in the search bar. You should see the Code Llama 70B model listed under the Models category.
  3. Deploy the Model
    Select the Code Llama 70B model, and then choose Deploy. Enter an endpoint name (or keep the default value) and select the target instance type (for example, ml.g5.48xlarge). Choose Deploy to start the deployment process. You can leave the rest of the options as default.

Additional details on deployment can be found in Code Llama 70B is now available in Amazon SageMaker JumpStart

  1. Create an inference endpoint
    After the deployment is complete, SageMaker will provide you with an inference endpoint URL. Copy this URL to use later.
  2. Set set up your development environment
    You can interact with the deployed Code Llama 70B model using Python and the AWS SDK for Python (Boto3). First, make sure you have the required dependencies installed: pip install boto3

Note: This blog post section contains code that was generated with the assistance of Code Llama70B powered by Amazon Sagemaker.

Generating a transformer model for natural language processing

Let us walk through a code generation example with Code Llama 70B where you will generate a transformer model in python using Amazon SageMaker SDK.

Prompt:

<s>[INST]
<<SYS>>You are an expert code assistant that can teach a junior developer how to code. Your language of choice is Python. Don't explain the code, just generate the code block itself. Always use Amazon SageMaker SDK for python code generation. Add test case to test the code<</SYS>>

Generate a Python code that defines and trains a Transformer model for text classification on movie dataset. The python code should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.
[/INST]

Response:

Code Llama generates a Python script for training a Transformer model on the sample dataset using TensorFlow and Amazon SageMaker.

Code example:
Create a new Python script (for example, code_llama_inference.py) and add the following code. Replace <YOUR_ENDPOINT_NAME> with the actual inference endpoint name provided by SageMaker JumpStart:

import boto3
import json

# Set up the SageMaker client
session = boto3.Session()
sagemaker_client = session.client("sagemaker-runtime")

# Set the inference endpoint URL
endpoint_name = "<YOUR_ENDPOINT_NAME>"

def query_endpoint(payload):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload).encode('utf-8'),
    )
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    return response

def print_completion(prompt: str, response: str) -> None:
    bold, unbold = '33[1m', '33[0m'
    print(f"{bold}> Input{unbold}n{prompt}{bold}n> Output{unbold}n{response[0]['generated_text']}n")

# Define the input prompt

prompt = """

<s>[INST]
<<SYS>>You are an expert code assistant that can teach a junior developer how to code. Your language of choice is Python. Don't explain the code, just generate the code block itself. Always use Amazon SageMaker SDK for python code generation. Add test case to test the code<</SYS>>
 
Generate a Python code that defines and trains a Transformer model for text classification on movie dataset. The python code should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.
[/INST]
"""



# Send the request to the endpoint and decode the response

payload = {"inputs": prompt,
"parameters": {"max_new_tokens": 2000, "temperature": 0.2, "top_p": 0.9},
}


response = query_endpoint(payload)

print_completion(prompt, response)

Save the script and run it:

python code_llama_inference.py

The script will send the provided prompt to the Code Llama 70B model deployed on SageMaker, and the model’s response will be printed to the output.

Example output:

Input

<s>[INST]
<<SYS>>
You are an expert AI assistant specializing in generating Python code for natural language processing (NLP) tasks on Amazon SageMaker. Your code should follow best practices, be well-documented, and include examples of usage.
<</SYS>>

Generate a Python script that defines and trains a Transformer model for text classification on movie dataset. The script should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.
[/INST]


> Output

[RESP]
import os
import json
import sagemaker
import boto3
import numpy as np
import pandas as pd
from sagemaker.tensorflow import TensorFlow
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups

# Set up SageMaker session
sagemaker_session = sagemaker.Session()
region = boto3.Session().region_name
bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-xSum"

# Download dataset
train = fetch_20newsgroups(subset="train")
test = fetch_20newsgroups(subset="test")

# Prepare dataset
train_data = pd.DataFrame(
{"text": train.data, "label": train.target.astype(np.int32)}
)
test_data = pd.DataFrame({"text": test.data, "label": test.target.astype(np.int32)})

# Split dataset
train_data, validation_data = train_test_split(train_data, test_size=0.2)

# Upload dataset to S3
train_input = sagemaker_session.upload_data(
path=train_data.to_csv(header=False, index=False),
bucket=bucket,
key_prefix=os.path.join(prefix, "train"),
)
validation_input = sagemaker_session.upload_data(
path=validation_data.to_csv(header=False, index=False),
bucket=bucket,
key_prefix=os.path.join(prefix, "validation"),
)
test_input = sagemaker_session.upload_data(
path=test_data.to_csv(header=False, index=False),
bucket=bucket,
key_prefix=os.path.join(prefix, "test"),
)

# Define hyperparameters
hyperparameters = {
"epochs": 1,
"batch_size": 64,
"max_seq_length": 256,
"model_dir": "/opt/ml/model",
"train_dir": "/opt/ml/input/data/train",
"validation_dir": "/opt/ml/input/data/validation",
"test_dir": "/opt/ml/input/data/test",
"train_filename": "train.csv",
"validation_filename": "validation.csv",
"test_filename": "test.csv",
"freeze_bert_layer": True,
"learning_rate": 0.0001,
"epsilon": 1e-08,
"bert_trainable_layers": None,
"bert_layer": -2,
"lambda_regularizer": 0.01,
"gradient_accumulation_steps": 2,
"steps_per_epoch": 100,
"validation_steps": 10,
"test_steps": 10,
"output_dir": "/opt/ml/output",
}

# Define estimator
estimator = TensorFlow(
entry_point="train.py",
source_dir="code",
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.4.1",
py_version="py37",
hyperparameters=hyperparameters,
)

# Train model
estimator.fit(
{
"train": train_input,
"validation": validation_input,
"test": test_input,
},
wait=True,
logs=True,
)

# Deploy model
predictor = estimator.deploy(
initial_instance_count=1, instance_type="ml.p3.2xlarge", wait=True
)

# Predict
predictor.predict(json.dumps({"instances": [test_data.iloc[0]["text"]]}))
<</RESP>>

You can modify the prompt variable to request different code generation tasks or engage in natural language interactions with the model.

This example demonstrates how to deploy and interact with the Code Llama 70B model on SageMaker JumpStart using Python and the AWS SDK. Because the model might be prone to minor errors in generating the response output, make sure you run the code. Further, you can instruct the model to fact-check the output and refine the model response in order to fix any other unnecessary errors in the code. With this setup, you can leverage the powerful code generation capabilities of Code Llama 70B within your development workflows, streamlining the coding process and unlocking new levels of productivity. Lets take a look at some additional examples.

Additional examples and use cases

Let’s walk through some other complex code generation scenarios. In the following sample, we’re running the script to generate a Deep Q reinforcement learning (RL) agent for playing the CartPole-v0 environment.

Generating a reinforcement learning agent

The following prompt was tested on Code Llama 70B to generate a Deep Q RL agent adept in playing CartPole-v0 environment.

Prompt:

<s>[INST]
<<SYS>>
You are a skilled AI assistant capable of generating Python code for reinforcement learning tasks on Amazon SageMaker. Your code should be efficient, well-documented, and include examples of usage.
<</SYS>>

Could you please generate a Python script that implements a Deep Q-Network (DQN) agent for playing the CartPole-v1 environment? The script should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.
[/INST]

Response: Code Llama generates a Python script for training a DQN agent on the CartPole-v1 environment using TensorFlow and Amazon SageMaker as showcased in our GitHub repository.

Generating a distributed training script

In this scenario, you will generate a sample python code for distributed machine learning training on Amazon SageMaker using Code Llama 70B.

Prompt:

<s>[INST]
<<SYS>>
You are an expert AI assistant skilled in generating Python code for distributed machine learning training on Amazon SageMaker. Your code should be optimized for performance, follow best practices, and include examples of usage.
<</SYS>>

Could you please generate a Python script that performs distributed training of a deep neural network for image classification on the ImageNet dataset? The script should use Amazon SageMaker's PyTorch estimator with distributed data parallelism and be ready for deployment on SageMaker.
[/INST]

Response: Code Llama generates a Python script for distributed training of a deep neural network on the ImageNet dataset using PyTorch and Amazon SageMaker. Additional details are available in our GitHub repository.

Mixtral 8x7B use cases with SageMaker

Compared to traditional LLMs, Mixtral 8x7B offers the advantage of faster decoding at the speed of a smaller, parameter-dense model despite containing more parameters. It also outperforms other open-access models on certain benchmarks and supports a longer context length.

  1. Launch SageMaker JumpStart
    Sign in to the console, navigate to SageMaker, and launch the SageMaker domain to open SageMaker Studio. Within SageMaker Studio, select JumpStart in the left-hand navigation menu.
  2. Search for Mixtral 8x7B Instruct
    In the JumpStart model hub, search for Mixtral 8x7B Instruct in the search bar. You should see the Mixtral 8x7B Instruct model listed under the Models category.
  3. Deploy the Model
    Select the Code Llama 70B model, and then choose Deploy. Enter an endpoint name (or keep the default value) and choose the target instance type (for example, ml.g5.48xlarge). Choose Deploy to start the deployment process. You can leave the rest of the options as default.

Additional details on deployment can be found in Mixtral-8x7B is now available in Amazon SageMaker JumpStart.

  1. Create an inference endpoint
    After the deployment is complete, SageMaker will provide you with an inference endpoint URL. Copy this URL to use later.

Generating a hyperparameter tuning script for SageMaker

Hyperparameters are external configuration variables that data scientists use to manage machine learning model training. Sometimes called model hyperparameters, the hyperparameters are manually set before training a model. They’re different from parameters, which are internal parameters automatically derived during the learning process and not set by data scientists. Hyperparameters directly control model structure, function, and performance.

When you build complex machine learning systems like deep learning neural networks, exploring all the possible combinations is impractical. Hyperparameter tuning can accelerate your productivity by trying many variations of a model. It looks for the best model automatically by focusing on the most promising combinations of hyperparameter values within the ranges that you specify. To get good results, you must choose the right ranges to explore.

SageMaker automatic model tuning (AMT) finds the best version of a model by running many training jobs on your dataset. To do this, AMT uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that creates a model that performs the best, as measured by a metric that you choose.

Note: This blog post section contains code that was generated with the assistance of Mixtral 8X7B model, powered by Amazon Sagemaker.

Prompt:

<s>[INST]
<<SYS>>
You are an AI assistant specializing in generating Python code for hyperparameter tuning of machine learning models on Amazon SageMaker. Your code should be efficient, well-documented, and include examples of usage.
<</SYS>>

Generate a Python script that performs hyperparameter tuning of an XGBoost model for regression on the Boston Housing dataset. The script should use Amazon SageMaker's XGBoost estimator and be ready for deployment on SageMaker. It should optimize for validation:accuracy, and optimize the hyperparameters alpha and eta.

[/INST]

Response:

Here's a Python script that performs hyperparameter tuning of an XGBoost model for regression on the Boston Housing dataset using Amazon SageMaker. This script uses the XGBoost estimator and optimizes for validation:accuracy while tuning the hyperparameters `alpha` and `eta`.

```python
import boto3
import numpy as np
import pandas as pd
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

# Set up SageMaker session and specify the IAM role
sagemaker_session = boto3.Session(region_name='us-west-2')
role = sagemaker_session.get_execution_role()

# Specify the SageMaker XGBoost image version
xgboost_image = 'xgboost:1.3-1'

# Define the hyperparameter tuning configuration
hyperparameter_ranges = {
    'alpha': ContinuousParameter(0.1, 1.0),
    'eta': ContinuousParameter(0.01, 0.3)
}

objective_metric_name = 'validation:accuracy'
metric_definitions = [
    {
        'Name': 'validation:accuracy',
        'Regex': 'validation-accuracy: ([0-9\.]+)'
    }
]

tuner = HyperparameterTuner(
    XGBoost(
        role=role,
        train_instance_count=1,
        train_instance_type='ml.m5.large',
        sagemaker_session=sagemaker_session,
        hyperparameters={'objective': 'reg:squarederror', 'num_boost_round': 100},
        base_job_name='xgboost-tuner-boston'
    ),
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=10,
    max_parallel_jobs=3
)

# Load the Boston Housing dataset
boston_data = pd.
read_csv('boston_housing.csv')
train_data = boston_data.sample(frac=0.8, random_state=42)
test_data = boston_data.drop(train_data.index)

train_input_path = sagemaker_session.upload_data(path=train_data, key_prefix='xgboost-tuner/input/train')
test_input_path = sagemaker_session.upload_data(path=test_data, key_prefix='xgboost-tuner/input/test')

# Set up the training and validation datasets
train_channel = sagemaker_session.input_channel(name='train', path=train_input_path)
test_channel = sagemaker_session.input_channel(name='test', path=test_input_path)

# Define the SageMaker XGBoost estimator for training
xgboost_estimator = XGBoost(
    role=role,
    train_instance_count=1,
    train_instance_type='ml.m5.large',
    sagemaker_session=sagemaker_session,
    hyperparameters={'objective': 'reg:squarederror', 'num_boost_round': 100},
    base_job_name='xgboost-tuner-boston'
)

# Perform hyperparameter tuning
tuner.fit({'train': train_channel, 'test': test_channel}, include_cls_metadata=False)

# Get the best hyperparameters and create a new XGBoost estimator
best_training_job = tuner.best_training_job()
best_hyperparameters = tuner.best_hyperparameters
print(f'Best hyperparameters: {best_hyperparameters}')

new_xgboost_estimator = XGBoost(
    role=role,
    train_instance_count=1,
    train_instance_type='ml.m5.large',
    sagemaker_session=sagemaker_session,
    hyperparameters=best_hyperparameters,
    base_job_name='xgboost-tuner-boston'
)

# Train the model with the best hyperparameters
new_xgboost_estimator.fit({'train': train_channel, 'test': test_channel}, include_cls_metadata=False)

# Save the trained model for deployment
model_data = new_xgboost_estimator.create_model()
model_name = 'xgboost-tuner-boston-model'
sagemaker_session.create_model(model_name=model_name, model_data=model_data)

# Deploy the model
predictor = sagemaker_session.create_predictor(model_name,
                                                base_job_name='xgboost-tuner-boston',
                                                deploy_prebuilt_image=True,
                                                entry_point='inference.py',
                                                source_dir='xgboost-tuner',
                                                instance_type='ml.m5.large',
                                                initial_instance_count=1)
```

Replace `'boston_housing.csv'` with the path to your Boston Housing dataset. This script assumes that the dataset has columns named `'CRIM'`, `'ZN'`, `'INDUS'`, `'CHAS'`, `'NOX'`, `'RM'`, `'AGE'`, `'DIS'`, `'RAD'`, `'TAX'`, `'PTRATIO'`, `'B'`, `'LSTAT'`, and `'MEDV'`.

To run this script, you need to have the AWS CLI and Boto3 installed. Make sure to configure your AWS credentials using `aws configure`.

The script also requires a file named `inference.py` in the `xgboost-tuner` directory for deployment. Here's a simple example of `inference.py`:

```python
import json
import xgboost as xgb
import numpy as np

def model_fn(model_dir):
    model = xgb.Booster(model_file=os.path.join(model_dir, 'model.joblib'))
    return model

def input_fn(request_body, content_type):
    if content_type == 'text/csv':
        features = np.loadtxt(pd.compat.StringIO(request_body), delimiter=',')
        return {'features': features.reshape(1, -1)}
    raise ValueError("Request body must be csv-formatted.")

def predict_fn(input_data, model):
    features = input_data['features']
    prediction = model.predict(features)
    return {'prediction': float(prediction)}
```

Make sure to install the required libraries (`boto3`, `numpy`, `pandas`, `xgboost`, and `sagemaker`) before running the script.

Code Transformation: Java to Python

There are instances where users need to convert code written in one programing language to another. This is known as a cross-language transformation task, and foundation models can help automate the process.

Prompt:

<s>[INST]
<<SYS>>
You are a programmer specialized in Java and Python languages. Your code should be efficient, well-documented, and include examples of usage.
<</SYS>>
Convert the following Java code to its equivalent in Python:

<java>
import java.util.ArrayList;

public class Main {
public static void main(String[] args) {
ArrayList<String> cars = new ArrayList<String>();
cars.add("Volvo");
cars.add("BMW");
cars.add("Ford");
cars.add("Mazda");
for (int i = 0; i < cars.size(); i++) {
System.out.println(cars.get(i));
}
}
}
</java>
[/INST]

Response:

<python>
import ArrayList from java.util import ArrayList

def main():
# Create an ArrayList of strings
cars = ArrayList[str]()

# Add elements to the ArrayList
cars.add("Volvo")
cars.add("BMW")
cars.add("Ford")
cars.add("Mazda")

# Iterate over the ArrayList and print each element
for i in range(cars.size()):
print(cars.get(i))

if __name__ == "__main__":
main()
</python>

Note: This Python code uses the Java ArrayList class, which is part of the JPype library that allows Python programs to call Java code. 
The code above will only work in an environment that has JPype installed and configured.

For pure Python solution, here is the equivalent code:

<python>
def main():
# Create a list of strings
cars = ["Volvo", "BMW", "Ford", "Mazda"]

# Iterate over the list and print each element
for car in cars:
print(car)

if __name__ == "__main__":
main()
</python>

This Python code uses a built-in list data structure instead of the Java ArrayList class. The code above is more idiomatic and efficient in Python.

AWS CDK code for a three-tier web application

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework for defining cloud infrastructure as code with modern programming languages and deploying it through AWS CloudFormation.

The three-tier architecture pattern provides a general framework to ensure decoupled and independently scalable application components can be separately developed, managed, and maintained (often by distinct teams). A three-tier architecture is the most popular implementation of a multi-tier architecture and consists of a single presentation tier, logic tier, and data tier:

  • Presentation tier: Component that the user directly interacts with (for example, webpages and mobile app UIs).
  • Logic tier: Code required to translate user actions to application functionality (for example, CRUD database operations and data processing).
  • Data tier: Storage media (for example, databases, object stores, caches, and file systems) that hold the data relevant to the application.

Prompt:

<s><<SYS>>
You are a programmer specialized in AWS, and programming in Python and the AWS Cloud Development Kit. Your code should be efficient, well-documented, and include examples of usage.
<</SYS>>[INST]
Create a CDK script in Python for configuring and deploying a three tier web application in AWS. The architecture should consist of these three tiers:

- Presentation tier: The front-end application is all static content (HTML, CSS, JavaScript and images) which are generated by React utilities like create-react-app. Amazon CloudFront hosts all these objects. The web application, when used, downloads all the resources to the browser and starts to run from there. The web application connects to the backend calling the APIs.

- Logic tier: Logic layer is built using Lambda functions fronted by API Gateway REST APIs. This architecture shows multiple exposed services. There are multiple different Lambda functions each handling a different aspect of the application. The Lambda functions are behind API Gateway and accessible using API URL paths.

- Data tier: DynamoDB is used for the data storage.
[/INST]

Response:

Sure, here's an example CDK script in Python that configures and deploys a three-tier web application in AWS:
```python
from aws_cdk import (
aws_apigateway as apigw,
aws_lambda as _lambda,
aws_lambda_event_sources as lambda_event_sources,
aws_logs as logs,
aws_dynamodb as ddb,
core
)

class ThreeTierWebAppStack(core.Stack):

def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
super().__init__(scope, id, **kwargs)

# Data tier
self.table = ddb.Table(
self, "DataTable",
partition_key=ddb.Attribute(name="id", type=ddb.AttributeType.STRING),
billing_mode=ddb.BillingMode.PAY_PER_REQUEST
)

# Logic tier
self.lambda_function = _lambda.Function(
self, "LambdaFunction",
runtime=_lambda.Runtime.PYTHON_3_8,
code=_lambda.Code.from_asset("lambda"),
handler="lambda_function.lambda_handler",
log_retention=logs.RetentionDays.ONE_DAY,
environment={
"TABLE_NAME": self.table.table_name
}
)

self.lambda_function.add_to_role_policy(
core.PolicyStatement(
actions=[
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem"
],
resources=[self.table.table_arn]
)
)

self.api = apigw.LambdaRestApi(
self, "Api",
handler=self.lambda_function,
proxy=False
)

# Presentation tier
self.distribution = aws_cloudfront.Distribution(
self, "Distribution",
default_root_object="index.html",
default_behavior_options=aws_cloudfront.BehaviorOptions(
origin=aws_cloudfront.Origin(
domain_name=self.api.url,
origin_path="/integration/v1"
),
viewer_protocol_policy=aws_cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS
)
)
```
Here's a brief explanation of the code:

* The `ThreeTierWebAppStack` class extends the `core.Stack` class and is the main entry point for the CDK script.
* The `table` attribute is a DynamoDB table used for data storage.
* The `lambda_function` attribute is a Lambda function that handles the business logic of the application. It has an IAM role that allows it to read and write to the DynamoDB table.
* The `api` attribute is an API Gateway REST API that exposes the Lambda function as a web service.
* The `distribution` attribute is a CloudFront distribution that serves the static content of the application. It uses the API Gateway as the origin and redirects all HTTP traffic to HTTPS.

To use this script, you need to create a `lambda` directory in the same directory as the script and put your Lambda function code in a file named `lambda_function.py` in that directory. You also need to install the AWS CDK and its dependencies using `pip install aws-cdk.core aws-cdk.aws-lambda aws-cdk.aws-lambda_event_sources aws-cdk.aws-logs aws-cdk.aws-dynamodb aws-cdk.aws-apigateway aws-cdk.aws-cloudfront`.

Once you have done that, you can deploy the stack using `cdk deploy`. This will create the DynamoDB table, the Lambda function, the API Gateway, and the CloudFront distribution. The URL of the CloudFront distribution will be printed to the console after the deployment is complete. You can use this URL to access the web application.

Additional considerations

The following are some additional considerations when implementing these models:

  • Different models will produce different results, so you should conduct experiments with different foundation models and different prompts for your use case to achieve the desired results.
  • The analyses provided are not meant to replace human judgement. You should be mindful of potential hallucinations when working with generative AI, and use the analysis only as a tool to assist and speed up code generation.

Clean up

Delete the model endpoints deployed using Amazon SageMaker for Code Llama and Mistral to avoid incurring any additional costs in your account.

Shut down any SageMaker Notebook instances that were created for deploying or running the examples showcased in this blog post to avoid any notebook instance costs associated with the account.

Conclusion

The combination of exceptional capabilities from foundation models like Code Llama 70B and Mixtral 8x7B and the powerful machine learning platform of Sagemaker, presents a unique opportunity for developers and data scientists to revolutionize their coding workflows. The cutting-edge capabilities of FMs empower customers to generate high-quality code, infill missing sections, and engage in natural language interactions, all while using the scalability, security, and compliance of AWS.

The examples highlighted in this blog post demonstrate these models’ advanced capabilities in generating complex code for various machine learning tasks, such as natural language processing, reinforcement learning, distributed training, and hyperparameter tuning, all tailored for deployment on SageMaker. Developers and data scientists can now streamline their workflows, accelerate development cycles, and unlock new levels of productivity in the AWS Cloud.

Embrace the future of AI-assisted coding and unlock new levels of productivity with Code Llama 70B and Mixtral 8x7B on Amazon SageMaker. Start your journey today and experience the transformative power of this groundbreaking language model.

References

  1. Code Llama 70B is now available in Amazon SageMaker JumpStart
  2. Fine-tune Code Llama on Amazon SageMaker JumpStart
  3. Mixtral-8x7B is now available in Amazon SageMaker JumpStart

About the Authors

Shikhar Kwatra is an AI/ML Solutions Architect at Amazon Web Services based in California. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partners in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.

Jose Navarro is an AI/ML Solutions Architect at AWS based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production. In his spare time, he loves to exercise, spend quality time with friends and family, and catch up on AI news and papers.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Read More

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

Today, we are excited to announce that the Jina Embeddings v2 model, developed by Jina AI, is available for customers through Amazon SageMaker JumpStart to deploy with one click for running model inference. This state-of-the-art model supports an impressive 8,192-tokens context length. You can deploy this model with SageMaker JumpStart, a machine learning (ML) hub with foundation models, built-in algorithms, and pre-built ML solutions that you can deploy with just a few clicks.

Text embedding refers to the process of transforming text into numerical representations that reside in a high-dimensional vector space. Text embeddings have a broad range of applications in enterprise artificial intelligence (AI), including the following:

  • Multimodal search for ecommerce
  • Content personalization
  • Recommender systems
  • Data analytics

Jina Embeddings v2 is a state-of-the-art collection of text embedding models, trained by Berlin-based Jina AI, that boast high performance on several public benchmarks.

In this post, we walk through how to discover and deploy the jina-embeddings-v2 model as part of a Retrieval Augmented Generation (RAG)-based question answering system in SageMaker JumpStart. You can use this tutorial as a starting point for a variety of chatbot-based solutions for customer service, internal support, and question answering systems based on internal and private documents.

What is RAG?

RAG is the process of optimizing the output of a large language model (LLM) so it references an authoritative knowledge base outside of its training data sources before generating a response.

LLMs are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It’s a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

What does Jina Embeddings v2 bring to RAG applications?

A RAG system uses a vector database to serve as a knowledge retriever. It must extract a query from a user’s prompt and send it to a vector database to reliably find as much semantic information as possible. The following diagram illustrates the architecture of a RAG application with Jina AI and Amazon SageMaker.

Jina Embeddings v2 is the preferred choice for experienced ML scientists for the following reasons:

  • State-of-the-art performance – We have shown on various text embedding benchmarks that Jina Embeddings v2 models excel on tasks such as classification, reranking, summarization, and retrieval. Some of the benchmarks demonstrating their performance are MTEB, an independent study of combining embedding models with reranking models, and the LoCo benchmark by a Stanford University group.
  • Long input-context length – Jina Embeddings v2 models support 8,192 input tokens. This makes the models especially powerful at tasks such as clustering for long documents like legal text or product documentation.
  • Support for bilingual text input Recent research shows that multilingual models without specific language training show strong biases towards English grammatical structures in embeddings. Jina AI’s bilingual embedding models include jina-embeddings-v2-base-de, jina-embeddings-v2-base-zh, jina-embeddings-v2-base-es, and jina-embeddings-v2-base-code. They were trained to encode texts in a combination of English-German, English-Chinese, English-Spanish, and English-Code, respectively, allowing the use of either language as the query or target document in retrieval applications.
  • Cost-effectiveness of operating – Jina Embeddings v2 provides high performance on information retrieval tasks with relatively small models and compact embedding vectors. For example, jina-embeddings-v2-base-de has a size of 322 MB with a performance score of 60.1%. A smaller vector size means a great amount of cost savings while storing them in a vector database.

What is SageMaker JumpStart?

With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. Developers can deploy foundation models to dedicated SageMaker instances within a network-isolated environment, and customize models using SageMaker for model training and deployment.

You can now discover and deploy a Jina Embeddings v2 model with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines and Amazon SageMaker Debugger. With SageMaker JumpStart, the model is deployed in an AWS secure environment and under your VPC controls, helping provide data security.

Jina Embeddings models are available in AWS Marketplace so you can integrate them directly into your deployments when working in SageMaker.

AWS Marketplace enables you to find third-party software, data, and services that run on AWS and manage them from a centralized location. AWS Marketplace includes thousands of software listings and simplifies software licensing and procurement with flexible pricing options and multiple deployment methods.

Solution overview

We’ve prepared a notebook that constructs and runs a RAG question answering system using Jina Embeddings and the Mixtral 8x7B LLM in SageMaker JumpStart.

In the following sections, we give you an overview of the main steps needed to bring a RAG application to life using generative AI models on SageMaker JumpStart. Although we omit some of the boilerplate code and installation steps in this post for reasons of readability, you can access the full Python notebook to run yourself.

Connecting to a Jina Embeddings v2 endpoint

To start using Jina Embeddings v2 models, complete the following steps:

  1. In SageMaker Studio, choose JumpStart in the navigation pane.
  2. Search for “jina” and you will see the provider page link and models available from Jina AI.
  3. Choose Jina Embeddings v2 Base – en, which is Jina AI’s English language embeddings model.
  4. Choose Deploy.
  5. In the dialog that appears, choose Subscribe, which will redirect you to the model’s AWS Marketplace listing, where you can subscribe to the model after accepting the terms of usage.
  6. After subscribing, return to the Sagemaker Studio and choose Deploy.
  7. You will be redirected to the endpoint configuration page, where you can select the instance most suitable for your use case and provide a name for the endpoint.
  8. Choose Deploy.

After you create the endpoint, you can connect to it with the following code snippet:

from jina_sagemaker import Client
 
client = Client(region_name=region)
# Make sure that you’ve given the same name my-jina-embeddings-endpoint to the Jumpstart endpoint in the previous step.
endpoint_name = "my-jina-embeddings-endpoint"
 
client.connect_to_endpoint(endpoint_name=endpoint_name)

Preparing a dataset for indexing

In this post, we use a public dataset from Kaggle (CC0: Public Domain) that contains audio transcripts from the popular YouTube channel Kurzgesagt – In a Nutshell, which has over 20 million subscribers.

Each row in this dataset contains the title of a video, its URL, and the corresponding text transcript.

Enter the following code:

df.head()

Because the transcript of these videos can be quite long (around 10 minutes), in order to find only the relevant content for answering users’ questions and not other parts of the transcripts that are unrelated, you can chunk each of these transcripts before indexing them:

def chunk_text(text, max_words=1024):
    """
    Divide text into chunks where each chunk contains the maximum number of full sentences under `max_words`.
    """
    sentences = text.split('.')
    chunk = []
    word_count = 0
 
    for sentence in sentences:
        sentence = sentence.strip(".")
        if not sentence:
          continue
 
        words_in_sentence = len(sentence.split())
        if word_count + words_in_sentence <= max_words:
            chunk.append(sentence)
            word_count += words_in_sentence
        else:
            # Yield the current chunk and start a new one
            if chunk:
              yield '. '.join(chunk).strip() + '.'
            chunk = [sentence]
            word_count = words_in_sentence
 
    # Yield the last chunk if it's not empty
    if chunk:
        yield ' '.join(chunk).strip() + '.'

The parameter max_words defines the maximum number of full words that can be in a chunk of indexed text. Many chunking strategies exist in academic and non-peer-reviewed literature that are more sophisticated than a simple word limit. However, for the purpose of simplicity, we use this technique in this post.

Index text embeddings for vector search

After you chunk the transcript text, you obtain embeddings for each chunk and link each chunk back to the original transcript and video title:

def generate_embeddings(text_df):
    """
    Generate an embedding for each chunk created in the previous step.
    """

    chunks = list(chunk_text(text_df['Text']))
    embeddings = []
 
    for i, chunk in enumerate(chunks):
      response = client.embed(texts=[chunk])
      chunk_embedding = response[0]['embedding']
      embeddings.append(np.array(chunk_embedding))
 
    text_df['chunks'] = chunks
    text_df['embeddings'] = embeddings
    return text_df
 
print("Embedding text chunks ...")
df = df.progress_apply(generate_embeddings, axis=1)

The dataframe df contains a column titled embeddings that can be put into any vector database of your choice. Embeddings can then be retrieved from the vector database using a function such as find_most_similar_transcript_segment(query, n), which will retrieve the n closest documents to the given input query by a user.

Prompt a generative LLM endpoint

For question answering based on an LLM, you can use the Mistral 7B-Instruct model on SageMaker JumpStart:

from sagemaker.jumpstart.model import JumpStartModel
from string import Template

# Define the LLM to be used and deploy through Jumpstart.
jumpstart_model = JumpStartModel(model_id="huggingface-llm-mistral-7b-instruct", role=role)
model_predictor = jumpstart_model.deploy()

# Define the prompt template to be passed to the LLM
prompt_template = Template("""
  <s>[INST] Answer the question below only using the given context.
  The question from the user is based on transcripts of videos from a YouTube
    channel.
  The context is presented as a ranked list of information in the form of
    (video-title, transcript-segment), that is relevant for answering the
    user's question.
  The answer should only use the presented context. If the question cannot be
    answered based on the context, say so.
 
  Context:
  1. Video-title: $title_1, transcript-segment: $segment_1
  2. Video-title: $title_2, transcript-segment: $segment_2
  3. Video-title: $title_3, transcript-segment: $segment_3
 
  Question: $question
 
  Answer: [/INST]
""")

Query the LLM

Now, for a query sent by a user, you first find the semantically closest n chunks of transcripts from any video of Kurzgesagt (using vector distance between embeddings of chunks and the users’ query), and provide those chunks as context to the LLM for answering the users’ query:

# Define the query and insert it into the prompt template together with the context to be used to answer the question
question = "Can climate change be reversed by individuals' actions?"
search_results = find_most_similar_transcript_segment(question)
 
prompt_for_llm = prompt_template.substitute(
    question = question,
    title_1 = df.iloc[search_results[0][1]]["Title"].strip(),
    segment_1 = search_results[0][0],
    title_2 = df.iloc[search_results[1][1]]["Title"].strip(),
    segment_2 = search_results[1][0],
    title_3 = df.iloc[search_results[2][1]]["Title"].strip(),
    segment_3 = search_results[2][0]
)

# Generate the answer to the question passed in the propt
payload = {"inputs": prompt_for_llm}
model_predictor.predict(payload)

Based on the preceding question, the LLM might respond with an answer such as the following:

Based on the provided context, it does not seem that individuals can solve climate change solely through their personal actions. While personal actions such as using renewable energy sources and reducing consumption can contribute to mitigating climate change, the context suggests that larger systemic changes are necessary to address the issue fully.

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped. Use the following code:

model_predictor.delete_model()
model_predictor.delete_endpoint()

Conclusion

By taking advantage of the features of Jina Embeddings v2 to develop RAG applications, together with the streamlined access to state-of-the-art models on SageMaker JumpStart, developers and businesses are now empowered to create sophisticated AI solutions with ease.

Jina Embeddings v2’s extended context length, support for bilingual documents, and small model size enables enterprises to quickly build natural language processing use cases based on their internal datasets without relying on external APIs.

Get started with SageMaker JumpStart today, and refer to the GitHub repository for the complete code to run this sample.

Connect with Jina AI

Jina AI remains committed to leadership in bringing affordable and accessible AI embeddings technology to the world. Our state-of-the-art text embedding models support English and Chinese and soon will support German, with other languages to follow.

For more information about Jina AI’s offerings, check out the Jina AI website or join our community on Discord.


About the Authors

Francesco Kruk is Product Managment intern at Jina AI and is completing his Master’s at ETH Zurich in Management, Technology, and Economics. With a strong business background and his knowledge in machine learning, Francesco helps customers implement RAG solutions using Jina Embeddings in an impactful way.

Saahil Ognawala is Head of Product at Jina AI based in Munich, Germany. He leads the development of search foundation models and collaborates with clients worldwide to enable quick and efficient deployment of state-of-the-art generative AI products. With an academic background in machine learning, Saahil is now interested in scaled applications of generative AI in the knowledge economy.

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS based in Munich, Germany. Roy helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS. Roy is passionate about computational optimization problems and improving the performance of AI workloads.

Read More

Detect email phishing attempts using Amazon Comprehend

Detect email phishing attempts using Amazon Comprehend

Phishing is the process of attempting to acquire sensitive information such as usernames, passwords and credit card details by masquerading as a trustworthy entity using email, telephone or text messages. There are many types of phishing based on the mode of communication and targeted victims. In an Email phishing attempt, an email is sent as a mode of communication to group of people. There are traditional rule-based approaches to detect email phishing. However, new trends are emerging that are hard to handle with a rule-based approach. There is need to use machine learning (ML) techniques to augment rule-based approaches for email phishing detection.

In this post, we show how to use Amazon Comprehend Custom to train and host an ML model to classify if the input email is an phishing attempt or not. Amazon Comprehend is a natural-language processing (NLP) service that uses ML to uncover valuable insights and connections in text. You can use Amazon Comprehend to identify the language of the text; extract key phrases, places, people, brands, or events; understand sentiment about products or services; and identify the main topics from a library of documents. You can customize Amazon Comprehend for your specific requirements without the skillset required to build ML-based NLP solutions. Comprehend Custom builds customized NLP models on your behalf, using training data that you provide. Comprehend Custom supports custom classification and custom entity recognition.

Solution overview

This post explains how you can use Amazon Comprehend to easily train and host an ML based model to detect phishing attempt. The following diagram shows how the phishing detection works.

Solution Overview

You can use this solution with your email servers in which emails are passed through this phishing detector. When an email is flagged as a phishing attempt, the email recipient still gets the email in their mailbox, but they can be shown an additional banner highlighting a warning to the user.

You can use this solution for experimentation with the use case, but AWS recommends building a training pipeline for your environments. For details on how to build a classification pipeline with Amazon Comprehend, see Build a classification pipeline with Amazon Comprehend custom classification.

We walk through the following steps to build the phishing detection model:

  1. Collect and prepare the dataset.
  2. Load the data in an Amazon Simple Storage Service (Amazon S3) bucket.
  3. Create the Amazon Comprehend custom classification model.
  4. Create the Amazon Comprehend custom classification model endpoint.
  5. Test the model.

Prerequisites

Before diving into this use case, complete the following prerequisites:

  1. Set up an AWS account.
  2. Create an S3 bucket. For instructions, see Create your first S3 bucket.
  3. Download the email-trainingdata.csv and upload the file to the S3 bucket.

Collect and prepare the dataset

Your training data should have both phishing and non-phishing emails. Email users with in the organization are asked to report phishing through their email clients. Gather all these phishing reports and examples of non-phishing emails to prepare the training data. You should have a minimum 10 examples per class. Label phishing emails as phishing and non-phishing emails as nonphishing. For minimum training requirements, see General quotas for document classification. Although minimum labels per class is a starting point, it’s recommended to provide hundreds of labels per class for performance on classification tasks across new inputs.

For custom classification, you train the model in either single-label mode or multi-label mode. Single-label mode associates a single class with each document. Multi-label mode associates one or more classes with each document. For this case, we will use single-label mode – phishing or nonphishing. The individual classes are mutually exclusive. For example, you can classify an email as phishing or not-phishing, but not both.

Custom classification supports models that you train with plain-text documents and models that you train with native documents (such as PDF, Word, or images). For more information about classifier models and their supported document types, see Training classification models. For a plain-text model, you can provide classifier training data as a CSV file or as an augmented manifest file that you create using Amazon SageMaker Ground Truth. The CSV file or augmented manifest file includes the text for each training document, and its associated labels.For a native document model, you provide classifier training data as a CSV file. The CSV file includes the file name for each training document and its associated labels. You include the training documents in the S3 input folder for the training job.

For this case, we will train a plain-text model using CSV file format. For each row, the first column contains the class label value. The second column contains an example text document for that class. Each row must end with n or rn characters.

The following example shows a CSV file containing two documents.

CLASS,Text of document 1

CLASS,Text of document 2

The following example shows two rows of a CSV file that trains a custom classifier to detect whether an email message is phishing:

phishing, “Hi, we need account details and SSN information to complete the payment. Please furnish your credit card details in the attached form.”

nonphishing,” Dear Sir / Madam, your latest statement was mailed to your communication address. After your payment is received, you will receive a confirmation text message at your mobile number. Thanks, customer support”

For information about preparing your training documents, see Preparing classifier training data.

Load the data in the S3 bucket

Load the training data in CSV format to the S3 bucket you created in the prerequisite steps. For instructions, refer to Uploading objects.

Load Data to S3

Create the Amazon Comprehend custom classification model

Custom classification supports two types of classifier models: plain-text models and native document models. A plain-text model classifies documents based on their text content. You can train the plain-text model using documents in one of following languages: English, Spanish, German, Italian, French, or Portuguese. The training documents for a given classifier must all use the same language. A native document model has the ability to process both scanned or digital semi-structured documents like PDFs, Microsoft Word documents, and images in their native format. A native document model also classifies documents based on text content. A native document model can also use additional signals, such as from the layout of the document. You train a native document model with native documents for the model to learn the layout information. You train the model using semi-structured documents, which includes the following document types such as digital and scanned PDF documents and Word documents; Images sunch as JPG files, PNG files, and single-page TIFF files and Amazon Textract API output JSON files. AWS recommends using a plain-text model to classify plain-text documents and a native document model to classify semi-structured documents.

Data specification for the custom classification model can be represented as follows.

Data Specification

You can train a custom classifier using either the Amazon Comprehend console or API. Allow several minutes to a few hours for the classification model creation to complete. The length of time varies based on the size of your input documents.

For training a customer classifier on the Amazon Comprehend console, set the following data specification options.

Train Model Data Input

Training Data Output

On the Classifiers page of the Amazon Comprehend console, the new classifier appears in the table, showing Submitted as its status. When the classifier starts processing the training documents, the status changes to Training. When a classifier is ready to use, the status changes to Trained or Trained with warnings. If the status is Trained with Warnings, review the skipped files folder in the classifier training output.

Model Version

If Amazon Comprehend encountered errors during creation or training, the status changes to In error. You can choose a classifier job in the table to get more information about the classifier, including any error messages.

After training the model, Amazon Comprehend tests the custom classifier model. If you don’t provide a test dataset, Amazon Comprehend trains the model with 90% of the training data. It reserves 10% of the training data to use for testing. If you do provide a test dataset, the test data must include at least one example for each unique label in the training dataset.

After Amazon Comprehend completes the custom classifier model training, it creates output files in the Amazon S3 output location that you specified in the CreateDocumentClassifier API request or the equivalent Amazon Comprehend console request. These output files are a confusion matrix and additional outputs for native document models. The format of the confusion matrix varies, depending on whether you trained your classifier using multi-class mode or multi-label mode.

After Amazon Comprehend creates the classifier model, the confusion matrix is available in the confusion_matrix.json file in the Amazon S3 output location. This confusion matrix provides metrics on how well the model performed in training. This matrix shows a matrix of labels that the model predicted, compared to the actual document labels. Amazon Comprehend uses a portion of the training data to create the confusion matrix. The following JSON file represents the matrix in confusion_matrix.json as an example.

Confusion Matrix

Amazon Comprehend provides metrics to help you estimate how well a custom classifier performs. Amazon Comprehend calculates the metrics using the test data from the classifier training job. The metrics accurately represent the performance of the model during training, so they approximate the model performance for classification of similar data.

Use the Amazon Comprehend console or API operations such as DescribeDocumentClassifier to retrieve the metrics for a custom classifier.

Model Version Performance

The actual output of many binary classification algorithms is a prediction score. The score indicates the system’s certainty that the given observation belongs to the positive class. To make the decision about whether the observation should be classified as positive or negative, as a consumer of this score, you interpret the score by picking a classification threshold and comparing the score against it. Any observations with scores higher than the threshold are predicted as the positive class, and scores lower than the threshold are predicted as the negative class.

Prediction Score

Create the Amazon Comprehend custom classification model endpoint

After you train a custom classifier, you can classify documents using Real-time analysis or an analysis job. Real-time analysis takes a single document as input and returns the results synchronously. An analysis job is an asynchronous job to analyze large documents or multiple documents in one batch. The following are the different options for using the custom classifier model.

Custom Classification Inference Types

Create an endpoint for the trained model. For instructions, refer to Real-tome analysis for customer classification (console). Amazon Comprehend assigns throughput to an endpoint using Inference Units (IU). An IU represents data throughput of 100 characters per second. You can provision the endpoint with up to 10 IU. You can scale the endpoint throughput either up or down by updating the endpoint. Endpoints are billed on 1-second increments, with a minimum of 60 seconds. Charges will continue to incur from the time you start the endpoint until it is deleted even if no documents are analyzed.

Create Model Endpoint

Test the Model

After the endpoint is ready, you can run the real-time analysis from the Amazon Comprehend console.

Real Time Endpoint

The sample input represents the email text, which is used for real-time analysis to detect if the email text is a phishing attempt or not.

Model Inference Input

Amazon Comprehend analyzes the input data using the custom model. Amazon Comprehend displays the discovered classes, along with a confidence assessment for each class. The insights section shows the inference results with confidence levels of the nonphishing and phishing classes. You can decide the threshold to decide the class of the inference. In this case, nonphishing is the inference results because this has more confidence than the phishing class. The model detects the input email text is a non-phishing email.

Model Inference Output

To integrate this capability of phishing detection in your real-world applications, you can use the Amazon API Gateway REST API with an AWS Lambda integration. Refer to the serverless pattern in Amazon API Gateway to AWS Lambda to Amazon Comprehend to know more.

Clean up

When you no longer need your endpoint, you should delete it so that you stop incurring costs from it. Also, delete the data file from S3 bucket. For more information on costs, see Amazon Comprehend Pricing.

Model endpoint cleanup

Conclusion

In this post, we walked you through the steps to create a phishing attempt detector using Amazon Comprehend custom classification. You can customize Amazon Comprehend for your specific requirements without the skillset required to build ML-based NLP solutions.

You can also visit the Amazon Comprehend Developer Guide, GitHub repository and Amazon Comprehend developer resources for videos, tutorials, blogs, and more.


About the author

Ajeet Tewari is a Solutions Architect for Amazon Web Services. He works with enterprise customers to help them navigate their journey to AWS. His specialties include architecting and implementing highly scalable OLTP systems and leading strategic AWS initiatives.

Read More

How Skyflow creates technical content in days using Amazon Bedrock

How Skyflow creates technical content in days using Amazon Bedrock

This guest post is co-written with Manny Silva, Head of Documentation at Skyflow, Inc.

Startups move quickly, and engineering is often prioritized over documentation. Unfortunately, this prioritization leads to release cycles that don’t match, where features release but documentation lags behind. This leads to increased support calls and unhappy customers.

Skyflow is a data privacy vault provider that makes it effortless to secure sensitive data and enforce privacy policies. Skyflow experienced this growth and documentation challenge in early 2023 as it expanded globally from 8 to 22 AWS Regions, including China and other areas of the world such as Saudi Arabia, Uzbekistan, and Kazakhstan. The documentation team, consisting of only two people, found itself overwhelmed as the engineering team, with over 60 people, updated the product to support the scale and rapid feature release cycles.

Given the critical nature of Skyflow’s role as a data privacy company, the stakes were particularly high. Customers entrust Skyflow with their data and expect Skyflow to manage it both securely and accurately. The accuracy of Skyflow’s technical content is paramount to earning and keeping customer trust. Although new features were released every other week, documentation for the features took an average of 3 weeks to complete, including drafting, review, and publication. The following diagram illustrates their content creation workflow.

Looking at our documentation workflows, we at Skyflow discovered areas where generative artificial intelligence (AI) could improve our efficiency. Specifically, creating the first draft—often referred to as overcoming the “blank page problem”—is typically the most time-consuming step. The review process could also be long depending on the number of inaccuracies found, leading to additional revisions, additional reviews, and additional delays. Both drafting and reviewing needed to be shorter to make doc target timelines match those of engineering.

To do this, Skyflow built VerbaGPT, a generative AI tool based on Amazon Bedrock. Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using the AWS tools without having to manage any infrastructure. With Amazon Bedrock, VerbaGPT is able to prompt large language models (LLMs), regardless of model provider, and uses Retrieval Augmented Generation (RAG) to provide accurate first drafts that make for quick reviews.

In this post, we share how Skyflow improved their workflow to create documentation in days instead of weeks using Amazon Bedrock.

Solution overview

VerbaGPT uses Contextual Composition (CC), a technique that incorporates a base instruction, a template, relevant context to inform the execution of the instruction, and a working draft, as shown in the following figure. For the instruction, VerbaGPT tells the LLM to create content based on the specified template, evaluate the context to see if it’s applicable, and revise the draft accordingly. The template includes the structure of the desired output, expectations for what sort of information should exist in a section, and one or more examples of content for each section to guide the LLM on how to process context and draft content appropriately. With the instruction and template in place, VerbaGPT includes as much available context from RAG results as it can, then sends that off for inference. The LLM returns the revised working draft, which VerbaGPT then passes back into a new prompt that includes the same instruction, the same template, and as much context as it can fit, starting from where the previous iteration left off. This repeats until all context is considered and the LLM outputs a draft matching the included template.

The following figure illustrates how Skyflow deployed VerbaGPT on AWS. The application is used by the documentation team and internal users. The solution involves deploying containers on Amazon Elastic Kubernetes Service (Amazon EKS) that host a Streamlit user interface and a backend LLM gateway that is able to invoke Amazon Bedrock or local LLMs, as needed. Users upload documents and prompt VerbaGPT to generate new content. In the LLM gateway, prompts are processed in Python using LangChain and Amazon Bedrock.

When building this solution on AWS, Skyflow followed these steps:

  1. Choose an inference toolkit and LLMs.
  2. Build the RAG pipeline.
  3. Create a reusable, extensible prompt template.
  4. Create content templates for each content type.
  5. Build an LLM gateway abstraction layer.
  6. Build a frontend.

Let’s dive into each step, including the goals and requirements and how they were addressed.

Choose an inference toolkit and LLMs

The inference toolkit you choose, if any, dictates your interface with your LLMs and what other tooling is available to you. VerbaGPT uses LangChain instead of directly invoking LLMs. LangChain has broad adoption in the LLM community, so there was a present and likely future ability to take advantage of the latest advancements and community support.

When building a generative AI application, there are many factors to consider. For instance, Skyflow wanted the flexibility to interact with different LLMs depending on the use case. We also needed to keep context and prompt inputs private and secure, which meant not using LLM providers who would log that information or fine-tune their models on our data. We needed to have a variety of models with unique strengths at our disposal (such as long context windows or text labeling) and to have inference redundancy and fallback options in case of outages.

Skyflow chose Amazon Bedrock for its robust support of multiple FMs and its focus on privacy and security. With Amazon Bedrock, all traffic remains inside AWS. VerbaGPT’s primary foundation model is Anthropic Claude 3 Sonnet on Amazon Bedrock, chosen for its substantial context length, though it also uses Anthropic Claude Instant on Amazon Bedrock for chat-based interactions.

Build the RAG pipeline

To deliver accurate and grounded responses from LLMs without the need for fine-tuning, VerbaGPT uses RAG to fetch data related to the user’s prompt. By using RAG, VerbaGPT became familiar with the nuances of Skyflow’s features and procedures, enabling it to generate informed and complimentary content.

To build your own content creation solution, you collect your corpus into a knowledge base, vectorize it, and store it in a vector database. VerbaGPT includes all of Skyflow’s documentation, blog posts, and whitepapers in a vector database that it can query during inference. Skyflow uses a pipeline to embed content and store the embedding in a vector database. This embedding pipeline is a multi-step process, and everyone’s pipeline is going to look a little different. Skyflow’s pipeline starts by moving artifacts to a common data store, where they are de-identified. If your documents have personally identifiable information (PII), payment card information (PCI), personal health information (PHI), or other sensitive data, you might use a solution like Skyflow LLM Privacy Vault to make de-identifying your documentation straightforward. Next, the pipeline chunks the documents into pieces, then finally calculates vectors for the text chunks and stores them in FAISS, an open source vector store. VerbaGPT uses FAISS because it is fast and straightforward to use from Python and LangChain. AWS also has numerous vector stores to choose from for a more enterprise-level content creation solution, including Amazon Neptune, Amazon Relational Database Service (Amazon RDS) for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon Kendra, Amazon OpenSearch Service, and Amazon DocumentDB (with MongoDB compatibility). The following diagram illustrates the embedding generation pipeline.

When chunking your documents, keep in mind that LangChain’s default splitting strategy can be aggressive. This can result in chunks of content that are so small that they lack meaningful context and result in worse output, because the LLM has to make (largely inaccurate) assumptions about the context, producing hallucinations. This issue is particularly noticeable in Markdown files, where procedures were fragmented, code blocks were divided, and chunks were often only single sentences. Skyflow created its own Markdown splitter to work more accurately with VerbaGPT’s RAG output content.

Create a reusable, extensible prompt template

After you deploy your embedding pipeline and vector database, you can start intelligently prompting your LLM with a prompt template. VerbaGPT uses a system prompt that instructs the LLM how to behave and includes a directive to use content in the Context section to inform the LLM’s response.

The inference process queries the vector database with the user’s prompt, fetches the results above a certain similarity threshold, and includes the results in the system prompt. The solution then sends the system prompt and the user’s prompt to the LLM for inference.

The following is a sample prompt for drafting with Contextual Composition that includes all the necessary components, system prompt, template, context, a working draft, and additional instructions:

System: """You're an expert writer tasked with creating content according to the user's request.
Use Template to structure your output and identify what kind of content should go in each section.
Use WorkingDraft as a base for your response.
Evaluate Context against Template to identify if there is any pertinent information.
If needed, update or refine WorkingDraft using the supplied Context.
Treat User input as additional instruction."""
---
Template: """Write a detailed how-to guide in Markdown using the following template:
# [Title]
This guide explains how to [insert a brief description of the task].
[Optional: Specify when and why your user might want to perform the task.]
...
"""
---
Context: [
  { "text": "To authenticate with Skyflow's APIs and SDKs, you need to create a service account. To create...", "metadata": { "source": "service-accounts.md" }},
  ...
]
---
WorkingDraft: ""
---
User: Create a how-to guide for creating a service account.

Create content templates

To round out the prompt template, you need to define content templates that match your desired output, such as a blog post, how-to guide, or press release. You can jumpstart this step by sourcing high-quality templates. Skyflow sourced documentation templates from The Good Docs Project. Then, we adapted the how-to and concept templates to align with internal styles and specific needs. We also adapted the templates for use in prompt templates by providing instructions and examples per section. By clearly and consistently defining the expected structure and intended content of each section, the LLM was able to output content in the formats needed, while being both informative and stylistically consistent with Skyflow’s brand.

LLM gateway abstraction layer

Amazon Bedrock provides a single API to invoke a variety of FMs. Skyflow also wanted to have inference redundancy and fallback options in case VerbaGPT experienced Amazon Bedrock service limit exceeded errors. To that end, VerbaGPT has an LLM gateway that acts as an abstraction layer that is invoked.

The main component of the gateway is the model catalog, which can return a LangChain llm model object for the specified model, updated to include any parameters. You can create this with a simple if/else statement like that shown in the following code:

from langchain.chains import LLMChain
from langchain_community.llms import Bedrock, CTransformers

prompt = ""   		# User input
prompt_template = ""   	# The LangChain-formatted prompt template object
rag_results = get_rag(prompt)   # Results from vector database

# Get chain-able model object and token limit.
def get_model(model=str,options=dict):
    if model == "claude-instant-v1":
        llm = Bedrock(
            model_id="anthropic.claude-instant-v1",
            model_kwargs={"max_tokens_to_sample": options["max_output_tokens"], "temperature": options["temperature"]}
        )
        token_limit = 100000

    elif model == "claude-v2.1":
        llm = Bedrock(
            model_id="anthropic.claude-v2.1",
            model_kwargs={"max_tokens_to_sample":  options["max_output_tokens"], "temperature": options["temperature"]}
        )
        token_limit = 200000

    elif model == "llama-2":
        config = {
            "context_length": 4096,
            "max_new_tokens": options["max_output_tokens"],
            "stop": [
                "Human:",
            ],
        }
        llm = CTransformers(
            model="TheBloke/Llama-2-7b-Chat-GGUF",
            model_file="llama-2-7b-chat.Q4_K_M.gguf",
            model_type="llama",
            config=config,
        )
        token_limit = 4096
  
    return llm, token_limit

llm, token_limit = get_model("claude-v2.1")

chain = LLMChain(
    llm=llm,
    prompt=prompt_template
)

response = chain.run({"input": prompt, "context":rag_results})

By mapping standard input formats into the function and handling all custom LLM object construction within the function, the rest of the code stays clean by using LangChain’s llm object.

Build a frontend

The final step was to add a UI on top of the application to hide the inner workings of LLM calls and context. A simple UI is key for generative AI applications, so users can efficiently prompt the LLMs without worrying about the details unnecessary to their workflow. As shown in the solution architecture, VerbaGPT uses Streamlit to quickly build useful, interactive UIs that allow users to upload documents for additional context and draft new documents rapidly using Contextual Composition. Streamlit is Python based, which makes it straightforward for data scientists to be efficient at building UIs.

Results

By using the power of Amazon Bedrock for inferencing and Skyflow for data privacy and sensitive data de-identification, your organization can significantly speed up the production of accurate, secure technical documents, just like the solution shown in this post. Skyflow was able to use existing technical content and best-in-class templates to reliably produce drafts of different content types in minutes instead of days. For example, given a product requirements document (PRD) and an engineering design document, VerbaGPT can produce drafts for a how-to guide, conceptual overview, summary, release notes line item, press release, and blog post within 10 minutes. Normally, this would take multiple individuals from different departments multiple days each to produce.

The new content flow shown in the following figure moves generative AI to the front of all technical content Skyflow creates. During the “Create AI draft” step, VerbaGPT generates content in the approved style and format in just 5 minutes. Not only does this solve the blank page problem, first drafts are created with less interviewing or asking engineers to draft content, freeing them to add value through feature development instead.

The security measures Amazon Bedrock provides around prompts and inference aligned with Skyflow’s commitment to data privacy, and allowed Skyflow to use additional kinds of context, such as system logs, without the concern of compromising sensitive information in third-party systems.

As more people at Skyflow used the tool, they wanted additional content types available: VerbaGPT now has templates for internal reports from system logs, email templates from common conversation types, and more. Additionally, although Skyflow’s RAG context is clean, VerbaGPT is integrated with Skyflow LLM Privacy Vault to de-identify sensitive data in user inference inputs, maintaining Skyflow’s stringent standards of data privacy and security even while using the power of AI for content creation.

Skyflow’s journey in building VerbaGPT has drastically shifted content creation, and the toolkit wouldn’t be as robust, accurate, or flexible without Amazon Bedrock. The significant reduction in content creation time—from an average of around 3 weeks to as little as 5 days, and sometimes even a remarkable 3.5 days—marks a substantial leap in efficiency and productivity, and highlights the power of AI in enhancing technical content creation.

Conclusion

Don’t let your documentation lag behind your product development. Start creating your technical content in days instead of weeks, while maintaining the highest standards of data privacy and security. Learn more about Amazon Bedrock and discover how Skyflow can transform your approach to data privacy.

If you’re scaling globally and have privacy or data residency needs for your PII, PCI, PHI, or other sensitive data, reach out to your AWS representative to see if Skyflow is available in your region.


About the authors

Manny Silva is Head of Documentation at Skyflow and the creator of Doc Detective. Technical writer by day and engineer by night, he’s passionate about intuitive and scalable developer experiences and likes diving into the deep end as the 0th developer.

Jason Westra is a Senior Solutions Architect for AWS AI/ML startups. He provides guidance and technical assistance that enables customers to build scalable, highly available, secure AI and ML workloads in AWS Cloud.

Read More

Streamline custom model creation and deployment for Amazon Bedrock with Provisioned Throughput using Terraform

Streamline custom model creation and deployment for Amazon Bedrock with Provisioned Throughput using Terraform

As customers seek to incorporate their corpus of knowledge into their generative artificial intelligence (AI) applications, or to build domain-specific models, their data science teams often want to conduct A/B testing and have repeatable experiments. In this post, we discuss a solution that uses infrastructure as code (IaC) to define the process of retrieving and formatting data for model customization and initiating the model customization. This enables you to version and iterate as needed.

With Amazon Bedrock, you can privately and securely customize foundation models (FMs) with your own data to build applications that are specific to your domain, organization, and use case. With custom models, you can create unique user experiences that reflect your company’s style, voice, and services.

Amazon Bedrock supports two methods of model customization:

  • Fine-tuning allows you to increase model accuracy by providing your own task-specific labeled training dataset and further specialize your FMs.
  • Continued pre-training allows you to train models using your own unlabeled data in a secure and managed environment and supports customer-managed keys. Continued pre-training helps models become more domain-specific by accumulating more robust knowledge and adaptability—beyond their original training.

In this post, we provide guidance on how to create an Amazon Bedrock custom model using HashiCorp Terraform that allows you to automate the process, including preparing datasets used for customization.

Terraform is an IaC tool that allows you to manage AWS resources, software as a service (SaaS) resources, datasets, and more, using declarative configuration. Terraform provides the benefits of automation, versioning, and repeatability.

Solution overview

We use Terraform to download a public dataset from the Hugging Face Hub, convert it to JSONL format, and upload it to an Amazon Simple Storage Service (Amazon S3) bucket with a versioned prefix. We then create an Amazon Bedrock custom model using fine-tuning, and create a second model using continued pre-training. Lastly, we configure Provisioned Throughput for our new models so we can test and deploy the custom models for wider usage.

The following diagram illustrates the solution architecture.

Diagram depicting Amazon Bedrock Custom Model creation process using Terraform.

The workflow includes the following steps:

  1. The user runs the terraform apply The Terraform local-exec provisioner is used to run a Python script that downloads the public dataset DialogSum from the Hugging Face Hub. This is then used to create a fine-tuning training JSONL file.
  2. An S3 bucket stores training, validation, and output data. The generated JSONL file is uploaded to the S3 bucket.
  3. The FM defined in the Terraform configuration is used as the source for the custom model training job.
  4. The custom model training job uses the fine-tuning training data stored in the S3 bucket to enrich the FM. Amazon Bedrock is able to access the data in the S3 bucket (including output data) due to the AWS Identity and Access Management (IAM) role defined in the Terraform configuration, which grants access to the S3 bucket.
  5. When the custom model training job is complete, the new custom model is available for use.

The high-level steps to implement this solution are as follows:

  1. Create and initialize a Terraform project.
  2. Create data sources for context lookup.
  3. Create an S3 bucket to store training, validation, and output data.
  4. Create an IAM service role that allows Amazon Bedrock to run a model customization job, access your training and validation data, and write your output data to your S3 bucket.
  5. Configure your local Python virtual environment.
  6. Download the DialogSum public dataset and convert it to JSONL.
  7. Upload the converted dataset to Amazon S3.
  8. Create an Amazon Bedrock custom model using fine-tuning.
  9. Configure custom model Provisioned Throughput for your models.

Prerequisites

This solution requires the following prerequisites:

Create and initialize a Terraform project

Complete the following steps to create a new Terraform project and initialize it. You can work in a local folder of your choosing.

  1. In your preferred terminal, create a new folder named bedrockcm and change to that folder:
    1. If on Windows, use the following code:
      md bedrockcm
      cd bedrockcm

    2. If on Mac or Linux, use the following code:
      mkdir bedrockcm
      cd bedrockcm

Now you can work in a text editor and enter in code.

  1. In your preferred text editor, add a new file with the following Terraform code:
terraform {
  required_version = ">= 1.0.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 5.35.0"
    }
  }
}
  1. Save the file in the root of the bedrockcm folder and name it main.tf.
  2. In your terminal, run the following command to initialize the Terraform working directory:
terraform init

The output will contain a successful message like the following:

“Terraform has been successfully initialized”

  1. In your terminal, validate the syntax for your Terraform files:
terraform validate

Create data sources for context lookup

The next step is to add configurations that define data sources that look up information about the context Terraform is currently operating in. These data sources are used when defining the IAM role and policies and when creating the S3 bucket. More information can be found in the Terraform documentation for aws_caller_identity, aws_partition, and aws_region.

  1. In your text editor, add the following Terraform code to your main.tf file:
# Data sources to query the current context Terraform is operating in
data "aws_caller_identity" "current" {}
data "aws_partition" "current" {}
data "aws_region" "current" {}
  1. Save the file.

Create an S3 bucket

In this step, you use Terraform to create an S3 bucket to use during model customization and associated outputs. S3 bucket names are globally unique, so you use the Terraform data source aws_caller_identity, which allows you to look up the current AWS account ID, and use string interpolation to include the account ID in the bucket name. Complete the following steps:

  1. Add the following Terraform code to your main.tf file:
# Create a S3 bucket
resource "aws_s3_bucket" "model_training" {
  bucket = "model-training-${data.aws_caller_identity.current.account_id}"
}
  1. Save the file.

Create an IAM service role for Amazon Bedrock

Now you create the service role that Amazon Bedrock will assume to operate the model customization jobs.

You first create a policy document, assume_role_policy, which defines the trust relationship for the IAM role. The policy allows the bedrock.amazonaws.com service to assume this role. You use global condition context keys for cross-service confused deputy prevention. There are also two conditions you specify: the source account must match the current account, and the source ARN must be an Amazon Bedrock model customization job operating from the current partition, AWS Region, and current account.

Complete the following steps:

  1. Add the following Terraform code to your main.tf file:
# Create a policy document to allow Bedrock to assume the role
data "aws_iam_policy_document" "assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]
    effect  = "Allow"
    principals {
      type        = "Service"
      identifiers = ["bedrock.amazonaws.com"]
    }
    condition {
      test     = "StringEquals"
      variable = "aws:SourceAccount"
      values   = [data.aws_caller_identity.current.account_id]
    }
    condition {
      test     = "ArnEquals"
      variable = "aws:SourceArn"
      values   = ["arn:${data.aws_partition.current.partition}:bedrock:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:model-customization-job/*"]
    }
  }
}

The second policy document, bedrock_custom_policy, defines permissions for accessing the S3 bucket you created for model training, validation, and output. The policy allows the actions GetObject, PutObject, and ListBucket on the resources specified, which are the ARN of the model_training S3 bucket and all of the buckets contents. You will then create an aws_iam_policy resource, which creates the policy in AWS.

  1. Add the following Terraform code to your main.tf file:
# Create a policy document to allow Bedrock to access the S3 bucket
data "aws_iam_policy_document" "bedrock_custom_policy" {
  statement {
    sid       = "AllowS3Access"
    actions   = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
    resources = [aws_s3_bucket.model_training.arn, "${aws_s3_bucket.model_training.arn}/*"]
  }
}

resource "aws_iam_policy" "bedrock_custom_policy" {
  name_prefix = "BedrockCM-"
  description = "Policy for Bedrock Custom Models customization jobs"
  policy      = data.aws_iam_policy_document.bedrock_custom_policy.json
}

Finally, the aws_iam_role resource, bedrock_custom_role, creates an IAM role with a name prefix of BedrockCM- and a description. The role uses assume_role_policy as its trust policy and bedrock_custom_policy as a managed policy to allow the actions specified.

  1. Add the following Terraform code to your main.tf file:
# Create a role for Bedrock to assume
resource "aws_iam_role" "bedrock_custom_role" {
  name_prefix = "BedrockCM-"
  description = "Role for Bedrock Custom Models customization jobs"

  assume_role_policy  = data.aws_iam_policy_document.assume_role_policy.json
  managed_policy_arns = [aws_iam_policy.bedrock_custom_policy.arn]
}
  1. Save the file.

Configure your local Python virtual environment

Python supports creating lightweight virtual environments, each with their own independent set of Python packages installed. You create and activate a virtual environment, and then install the datasets package.

  1. In your terminal, in the root of the bedrockcm folder, run the following command to create a virtual environment:
python3 -m venv venv
  1. Activate the virtual environment:
    1. If on Windows, use the following command:
      venvScriptsactivate

    2. If on Mac or Linux, use the following command:
      source venv/bin/activate

Now you install the datasets package via pip.

  1. In your terminal, run the following command to install the datasets package:
pip3 install datasets

Download the public dataset

You now use Terraform’s local-exec provisioner to invoke a local Python script that will download the public dataset DialogSum from the Hugging Face Hub. The dataset is already divided into training, validation, and testing splits. This example uses just the training split.

You prepare the data for training by removing the id and topic columns, renaming the dialogue and summary columns, and truncating the dataset to 10,000 records. You then save the dataset in JSONL format. You could also use your own internal private datasets; we use a public dataset for example purposes.

You first create the local Python script named dialogsum-dataset-finetune.py, which is used to download the dataset and save it to disk.

  1. In your text editor, add a new file with the following Python code:
import pandas as pd
from datasets import load_dataset

# Load the dataset from the huggingface hub
dataset = load_dataset("knkarthick/dialogsum")

# Convert the dataset to a pandas DataFrame
dft = dataset['train'].to_pandas()

# Drop the columns that are not required for fine-tuning
dft = dft.drop(columns=['id', 'topic'])

# Rename the columns to prompt and completion as required for fine-tuning.
# Ref: https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-prereq.html#model-customization-prepare
dft = dft.rename(columns={"dialogue": "prompt", "summary": "completion"})

# Limit the number of rows to 10,000 for fine-tuning
dft = dft.sample(10000,
    random_state=42)

# Save DataFrame as a JSONL file, with each line as a JSON object
dft.to_json('dialogsum-train-finetune.jsonl', orient='records', lines=True)
  1. Save the file in the root of the bedrockcm folder and name it dialogsum-dataset-finetune.py.

Next, you edit the main.tf file you have been working in and add the terraform_data resource type, uses a local provisioner to invoke your Python script.

  1. In your text editor, edit the main.tf file and add the following Terraform code:
resource "terraform_data" "training_data_fine_tune_v1" {
  input = "dialogsum-train-finetune.jsonl"

  provisioner "local-exec" {
    command = "python dialogsum-dataset-finetune.py"
  }
}

Upload the converted dataset to Amazon S3

Terraform provides the aws_s3_object resource type, which allows you to create and manage objects in S3 buckets. In this step, you reference the S3 bucket you created earlier and the terraform_data resource’s output attribute. This output attribute is how you instruct the Terraform resource graph that these resources need to be created with a dependency order.

  1. In your text editor, edit the main.tf file and add the following Terraform code:
resource "aws_s3_object" "v1_training_fine_tune" {
  bucket = aws_s3_bucket.model_training.id
  key    = "training_data_v1/${terraform_data.training_data_fine_tune_v1.output}"
  source = terraform_data.training_data_fine_tune_v1.output
}

Create an Amazon Bedrock custom model using fine-tuning

Amazon Bedrock has multiple FMs that support customization with fine-tuning. To see a list of the models available, use the following AWS Command Line Interface (AWS CLI) command:

  1. In your terminal, run the following command to list the FMs that support customization by fine-tuning:
aws bedrock list-foundation-models --by-customization-type FINE_TUNING

You use the Cohere Command-Light FM for this model customization. You add a Terraform data source to query the foundation model ARN using the model name. You then create the Terraform resource definition for aws_bedrock_custom_model, which creates a model customization job, and immediately returns.

The time it takes for model customization is non-deterministic, and is based on the input parameters, model used, and other factors.

  1. In your text editor, edit the main.tf file and add the following Terraform code:
data "aws_bedrock_foundation_model" "cohere_command_light_text_v14" {
  model_id = "cohere.command-light-text-v14:7:4k"
}

resource "aws_bedrock_custom_model" "cm_cohere_v1" {
  custom_model_name     = "cm_cohere_v001"
  job_name              = "cm.command-light-text-v14.v001"
  base_model_identifier = data.aws_bedrock_foundation_model.cohere_command_light_text_v14.model_arn
  role_arn              = aws_iam_role.bedrock_custom_role.arn
  customization_type    = "FINE_TUNING"

  hyperparameters = {
    "epochCount"             = "1"
    "batchSize"              = "8"
    "learningRate"           = "0.00001"
    "earlyStoppingPatience"  = "6"
    "earlyStoppingThreshold" = "0.01"
    "evalPercentage"         = "20.0"
  }

  output_data_config {
    s3_uri = "s3://${aws_s3_bucket.model_training.id}/output_data_v1/"
  }

  training_data_config {
    s3_uri = "s3://${aws_s3_bucket.model_training.id}/training_data_v1/${terraform_data.training_data_fine_tune_v1.output}"
  }
}
  1. Save the file.

Now you use Terraform to create the data sources and resources defined in your main.tf file, which will start a model customization job.

  1. In your terminal, run the following command to validate the syntax for your Terraform files:
terraform validate
  1. Run the following command to apply the configuration you created. Before creating the resources, Terraform will describe all the resources that will be created so you can verify your configuration:
terraform apply

Terraform will generate a plan and ask you to approve the actions, which will look similar to the following code:

...

Plan: 6 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value:
  1. Enter yes to approve the changes.

Terraform will now apply your configuration. This process runs for a few minutes. At this time, your custom model is not yet ready for use; it will be in a Training state. Wait for training to finish before continuing. You can review the status on the Amazon Bedrock console on the Custom models page.

Screenshot of Amazon Bedrock Console training a custom model

When the process is complete, you receive a message like the following:

Apply complete! Resources: 6 added, 0 changed, 0 destroyed.

You can also view the status on the Amazon Bedrock console.

Screenshot of Amazon Bedrock Console displaying a custom model training job in 'completed' status.

You have now created an Amazon Bedrock custom model using fine-tuning.

Configure custom model Provisioned Throughput

Amazon Bedrock allows you to run inference on custom models by purchasing Provisioned Throughput. This guarantees a consistent level of throughput in exchange for a term commitment. You specify the number of model units needed to meet your application’s performance needs. For evaluating custom models initially, you can purchase Provisioned Throughput hourly (on-demand) with no long-term commitment. With no commitment, a quota of one model unit is available per Provisioned Throughput.

You create a new resource for Provisioned Throughput, associate one of your custom models, and provide a name. You omit the commitment_duration attribute to use on-demand.

  1. In your text editor, edit the main.tf file and add the following Terraform code:
resource "aws_bedrock_provisioned_model_throughput" "cm_cohere_provisioned_v1" {
  provisioned_model_name = "${aws_bedrock_custom_model.cm_cohere_v1.custom_model_name}-provisioned"
  model_arn              = aws_bedrock_custom_model.cm_cohere_v1.custom_model_arn
  model_units            = 1 
}
  1. Save the file.

Now you use Terraform to create the resources defined in your main.tf file.

  1. In your terminal, run the following command to re-initialize the Terraform working directory:
terraform init

The output will contain a successful message like the following:

“Terraform has been successfully initialized”
  1. Validate the syntax for your Terraform files:
terraform validate
  1. Run the following command to apply the configuration you created:
terraform apply

Best practices and considerations

Note the following best practices when using this solution:

  • Data and model versioning – You can version your datasets and models by using version identifiers in your S3 bucket prefixes. This allows you to compare model efficacy and outputs. You could even operate a new model in a shadow deployment so that your team can evaluate the output relative to your models being used in production.
  • Data privacy and network security – With Amazon Bedrock, you are in control of your data, and all your inputs and customizations remain private to your AWS account. Your data, such as prompts, completions, custom models, and data used for fine-tuning or continued pre-training, is not used for service improvement and is never shared with third-party model providers. Your data remains in the Region where the API call is processed. All data is encrypted in transit and at rest. You can use AWS PrivateLink to create a private connection between your VPC and Amazon Bedrock.
  • Billing – Amazon Bedrock charges for model customization, storage, and inference. Model customization is charged per tokens processed. This is the number of tokens in the training dataset multiplied by the number of training epochs. An epoch is one full pass through the training data during customization. Model storage is charged per month, per model. Inference is charged hourly per model unit using Provisioned Throughput. For detailed pricing information, see Amazon Bedrock Pricing.
  • Custom models and Provisioned Throughput – Amazon Bedrock allows you to run inference on custom models by purchasing Provisioned Throughput. This guarantees a consistent level of throughput in exchange for a term commitment. You specify the number of model units needed to meet your application’s performance needs. For evaluating custom models initially, you can purchase Provisioned Throughput hourly with no long-term commitment. With no commitment, a quota of one model unit is available per Provisioned Throughput. You can create up to two Provisioned Throughputs per account.
  • Availability – Fine-tuning support on Meta Llama 2, Cohere Command Light, and Amazon Titan Text FMs is available today in Regions US East (N. Virginia) and US West (Oregon). Continued pre-training is available today in public preview in Regions US East (N. Virginia) and US West (Oregon). To learn more, visit the Amazon Bedrock Developer Experience and check out Custom models.

Clean up

When you no longer need the resources created as part of this post, clean up those resources to save associated costs. You can clean up the AWS resources created in this post using Terraform with the terraform destroy command.

First, you need to modify the configuration of the S3 bucket in the main.tf file to enable force destroy so the contents of the bucket will be deleted, so the bucket itself can be deleted. This will remove all of the sample data contained in the S3 bucket as well as the bucket itself. Make sure there is no data you want to retain in the bucket before proceeding.

  1. Modify the declaration of your S3 bucket to set the force_destroy attribute of the S3 bucket:
# Create a S3 bucket
resource "aws_s3_bucket" "model_training" {
  bucket = "model-training-${data.aws_caller_identity.current.account_id}"
  force_destroy = true
}
  1. Run the terraform apply command to update the S3 bucket with this new configuration:
terraform apply
  1. Run the terraform destroy command to delete all resources created as part of this post:
terraform destroy

Conclusion

In this post, we demonstrated how to create Amazon Bedrock custom models using Terraform. We introduced GitOps to manage model configuration and data associated with your custom models.

We recommend testing the code and examples in your development environment, and making appropriate changes as required to use them in production. Consider your model consumption requirements when defining your Provisioned Throughput.

We welcome your feedback! If you have questions or suggestions, leave them in the comments section.


About the Authors

Josh Famestad is a Solutions Architect at AWS helping public sector customers accelerate growth, add agility, and reduce risk with cloud-based solutions.

Kevon Mayers is a Solutions Architect at AWS. Kevon is a Core Contributor for Terraform and has led multiple Terraform initiatives within AWS. Prior to joining AWS, he was working as a DevOps engineer and developer, and before that was working with the GRAMMYs/The Recording Academy as a studio manager, music producer, and audio engineer.

Tyler Lynch is a Principal Solution Architect at AWS. Tyler leads Terraform provider engineering at AWS and is a Core Contributor for Terraform.

Read More

Boost productivity with video conferencing transcripts and summaries with the Amazon Chime SDK Meeting Summarizer solution

Boost productivity with video conferencing transcripts and summaries with the Amazon Chime SDK Meeting Summarizer solution

Businesses today heavily rely on video conferencing platforms for effective communication, collaboration, and decision-making. However, despite the convenience these platforms offer, there are persistent challenges in seamlessly integrating them into existing workflows. One of the major pain points is the lack of comprehensive tools to automate the process of joining meetings, recording discussions, and extracting actionable insights from them. This gap results in inefficiencies, missed opportunities, and limited productivity, hindering the seamless flow of information and decision-making processes within organizations.

To address this challenge, we’ve developed the Amazon Chime SDK Meeting Summarizer application deployed with the Amazon Cloud Development Kit (AWS CDK). This application uses an Amazon Chime SDK SIP media application, Amazon Transcribe, and Amazon Bedrock to seamlessly join meetings, record meeting audio, and process recordings for transcription and summarization. By integrating these services programmatically through the AWS CDK, we aim to streamline the meeting workflow, empower users with actionable insights, and drive better decision-making outcomes. Our solution currently integrates with popular platforms such as Amazon Chime, Zoom, Cisco Webex, Microsoft Teams, and Google Meet.

In addition to deploying the solution, we’ll also teach you the intricacies of prompt engineering in this post. We guide you through addressing parsing and information extraction challenges, including speaker diarization, call scheduling, summarization, and transcript cleaning. Through detailed instructions and structured approaches tailored to each use case, we illustrate the effectiveness of Amazon Bedrock, powered by Anthropic Claude models.

Solution overview

The following infrastructure diagram provides an overview of the AWS services that are used to create this meeting summarization bot. The core services used in this solution are:

  • An Amazon Chime SDK SIP Media Application is used to dial into the meeting and record meeting audio
  • Amazon Transcribe is used to perform speech-to-text processing of the recorded audio, including speaker diarization
  • Anthropic Claude models in Amazon Bedrock are used to identify names, improve the quality of the transcript, and provide a detailed summary of the meeting

For a detailed explanation of the solution, refer to the Amazon Chime Meeting Summarizer documentation.

Prerequisites

Before diving into the project setup, make sure you have the following requirements in place:

  • Yarn – Yarn must be installed on your machine.
  • AWS account – You’ll need an active AWS account.
  • Enable Claude Anthropic models – These models should be enabled in your AWS account. For more information, see Model access.
  • Enable Amazon Titan – Amazon Titan should be activated in your AWS account. For more information, see Amazon Titan Models.

Refer to our GitHub repository for a step-by-step guide on deploying this solution.

Access the meeting with Amazon Chime SDK

To capture the audio of a meeting, an Amazon Chime SDK SIP media application will dial into the meeting using the meeting provider’s dial-in number. The Amazon Chime SDK SIP media application (SMA) is a programable telephony service that will make a phone call over the public switched telephone network (PSTN) and capture the audio. SMA uses a request/response model with an AWS Lambda function to process actions. In this demo, an outbound call is made using the CreateSipMediaApplicationCall API. This will cause the Lambda function to be invoked with a NEW_OUTBOUND_CALL event.

Because most dial-in mechanisms for meetings require a PIN or other identification to be made, the SMA will use the SendDigits action to send these dual tone multi-frequency (DTMF) digits to the meeting provider. When the application has joined the meeting, it will introduce itself using the Speak action and then record the audio using the RecordAudio action. This audio will be saved in MP3 format and saved to an Amazon Simple Storage Service (Amazon S3) bucket.

Speaker diarization with Amazon Transcribe

Because the SMA is joined to the meeting as a participant, the audio will be a single channel of all the participants. To process this audio file, Amazon Transcribe will be used with the ShowSpeakerLabels setting:

const response = await transcribeClient.send(
    new StartTranscriptionJobCommand({
        TranscriptionJobName: jobName,
        IdentifyLanguage: true,
        MediaFormat: 'wav',
        Media: {
            MediaFileUri: audioSource,
        },
        Settings: {
            ShowSpeakerLabels: true,
            MaxSpeakerLabels: 10,
        },
        OutputBucketName: `${BUCKET}`,
        OutputKey: `${PREFIX_TRANSCRIBE_S3}/`,
    }),
);

With speaker diarization, Amazon Transcribe will distinguish different speakers in the transcription output. The JSON file that is produced will include the transcripts and items along with speaker labels grouped by speaker with start and end timestamps. With this information provided by Amazon Transcribe, a turn-by-turn transcription can be generated by parsing through the JSON. The result will be a more readable transcription. See the following example:

spk_0: Hey Court , how’s it going ?
spk_1: Hey Adam, it’s going good . How are you
spk_0: doing ? Well , uh hey , thanks for uh joining me today on this call . I’m excited to talk to you about uh architecting on Aws .
spk_1: Awesome . Yeah , thank you for inviting me . So ,
spk_0: uh can you tell me a little bit about uh the servers you’re currently using on premises ?
spk_1: Yeah . So for our servers , we currently have four web servers running with four gigabyte of RA M and two CP US and we’re currently running Linux as our operating system .
spk_0: Ok . And what are you using for your database ?
spk_1: Oh , yeah , for a database , we currently have a 200 gigabyte database running my as to will and I can’t remember the version . But um the thing about our database is sometimes it lags . So we’re really looking for a faster option for
spk_0: that . So , um when you’re , you’re noticing lags with reads or with rights to the database
spk_1: with , with reads .
spk_0: Yeah . Ok . Have you heard of uh read replicas ?
spk_1: I have not .
spk_0: Ok . That could be a potential uh solution to your problem . Um If you don’t mind , I’ll go ahead and just uh send you some information on that later for you and your team to review .
spk_1: Oh , yeah , thank you , Adam . Yeah . Anything that could help . Yeah , be really helpful .
spk_0: Ok , last question before I let you go . Um what are you doing uh to improve the security of your on premises ? Uh data ?
spk_1: Yeah , so , so number one , we have been experiencing some esto injection attacks . We do have a Palo Alto firewall , but we’re not able to fully achieve layer server protection . So we do need a better option for that .
spk_0: Have you ex have you experienced uh my sequel attacks in the past or sequel injections ?
spk_1: Yes.
spk_0: Ok , great . Uh All right . And then are you guys bound by any compliance standards like PC I DS S or um you know GDR ? Uh what’s another one ? Any of those C just uh
spk_1: we are bound by fate , moderate complaints . So .
spk_0: Ok . Well , you have to transition to fed ramp high at any time .
spk_1: Uh Not in the near future . No .
spk_0: Ok . All right , Court . Well , thanks for that uh context . I will be reaching out to you again here soon with a follow up email and uh yeah , I’m looking forward to talking to you again uh next week .
spk_1: Ok . Sounds good . Thank you , Adam for your help .
spk_0: All right . Ok , bye .
spk_1: All right . Take care .

Here, speakers have been identified based on the order they spoke. Next, we show you how to further enhance this transcription by identifying speakers using their names, rather than spk_0, spk_1, and so on.

Use Anthropic Claude and Amazon Bedrock to enhance the solution

This application uses a large language model (LLM) to complete the following tasks:

  • Speaker name identification
  • Transcript cleaning
  • Call summarization
  • Meeting invitation parsing

Prompt engineering for speaker name identification

The first task is to enhance the transcription by assigning names to speaker labels. These names are extracted from the transcript itself when a person introduces themselves and then are returned as output in JSON format by the LLM.

Special instructions are provided for cases where only one speaker is identified to provide consistency in the response structure. By following these instructions, the LLM will process the meeting transcripts and accurately extract the names of the speakers without additional words or spacing in the response. If no names are identified by the LLM, we prompted the model to return an Unknown tag.

In this demo, the prompts are designed using Anthropic Claude Sonnet as the LLM. You may need to tune the prompts to modify the solution to use another available model on Amazon Bedrock.

Human: You are a meeting transcript names extractor. Go over the transcript and extract the names from it. Use the following instructions in the <instructions></instructions> xml tags
<transcript> ${transcript} </transcript>
<instructions>
– Extract the names like this example – spk_0: “name1”, spk_1: “name2”.
– Only extract the names like the example above and do not add any other words to your response
– Your response should only have a list of “speakers” and their associated name separated by a “:” surrounded by {}
– if there is only one speaker identified then surround your answer with {}
– the format should look like this {“spk_0” : “Name”, “spk_1: “Name2”, etc.}, no unnecessary spacing should be added
</instructions>

Assistant: Should I add anything else in my answer?

Human: Only return a JSON formatted response with the Name and the speaker label associated to it. Do not add any other words to your answer. Do NOT EVER add any introductory sentences in your answer. Only give the names of the speakers actively speaking in the meeting. Only give the names of the speakers actively speaking in the meeting in the format shown above.

Assistant:

After the speakers are identified and returned in JSON format, we can replace the generic speaker labels with the name attributed speaker labels in the transcript. The result will be a more enhanced transcription:

Adam: Hey Court , how’s it going ?
Court: Hey Adam , it’s going good . How are you
Adam: doing ? Well , uh hey , thanks for uh joining me today on this call . I’m excited to talk to you about uh architecting on Aws .
Court: Awesome . Yeah , thank you for inviting me . So ,
Adam: uh can you tell me a little bit about uh the servers you’re currently using on premises ?
Court: Yeah . So for our servers , we currently have four web servers running with four gigabyte of RA M and two CP US and we’re currently running Linux as our operating system .

… transcript continues….

But what if a speaker can’t be identified because they never introduced themselves. In that case, we want to let the LLM leave them as unknown, rather than try to force or a hallucinate a label.

We can add the following to the instructions:

If no name is found for a speaker, use UNKNOWN_X where X is the speaker label number

The following transcript has three speakers, but only two identified speakers. The LLM must label a speaker as UNKNOWN rather than forcing a name or other response on the speaker.

spk_0: Yeah .
spk_1: Uh Thank you for joining us Court . I am your account executive here at Aws . Uh Joining us on the call is Adam , one of our solutions architect at Aws . Adam would like to introduce yourself .
spk_2: Hey , everybody . High Court . Good to meet you . Uh As your dedicated solutions architect here at Aws . Uh my job is to help uh you at every step of your cloud journey . Uh My , my role involves helping you understand architecting on Aws , including best practices for the cloud . Uh So with that , I’d love it if you could just take a minute to introduce yourself and maybe tell me a little about what you’re currently running on premises .
spk_0: Yeah , great . It’s uh great to meet you , Adam . Um My name is Court . I am the V P of engineering here at any company . And uh yeah , really excited to hear what you can do for us .
spk_1: Thanks , work . Well , we , we talked a little bit of last time about your goals for migrating to Aws . I invited Adam to , to join us today to get a better understanding of your current infrastructure and other technical requirements .
spk_2: Yeah . So co could you tell me a little bit about what you’re currently running on premises ?
spk_1: Sure . Yeah , we’re , uh ,
spk_0: we’re running a three tier web app on premise .

When we give Claude Sonnet the option to not force a name , we see results like this:

{“spk_0”: “Court”, “spk_1”: “UNKNOWN_1”, “spk_2”: “Adam”}

Prompt engineering for cleaning the transcript

Now that the transcript has been diarized with speaker attributed names, we can use Amazon Bedrock to clean the transcript. Cleaning tasks include eliminating distracting filler words, identifying and rectifying misattributed homophones, and addressing any diarization errors stemming from subpar audio quality. For guidance on accomplishing these tasks using Anthropic Claude Sonnet models, see the provided prompt:

Human: You are a transcript editor, please follow the <instructions> tags.
<transcript> ${transcript} </transcript>
<instructions>
– The <transcript> contains a speaker diarized transcript
– Go over the transcript and remove all filler words. For example “um, uh, er, well, like, you know, okay, so, actually, basically, honestly, anyway, literally, right, I mean.”
– Fix any errors in transcription that may be caused by homophones based on the context of the sentence. For example, “one instead of won” or “high instead of hi”
– In addition, please fix the transcript in cases where diarization is improperly performed. For example, in some cases you will see that sentences are split between two speakers. In this case infer who the actual speaker is and attribute it to them.

– Please review the following example of this,

Input Example
Court: Adam you are saying the wrong thing. What
Adam: um do you mean, Court?

Output:
Court: Adam you are saying the wrong thing.
Adam: What do you mean, Court?

– In your response, return the entire cleaned transcript, including all of the filler word removal and the improved diarization. Only return the transcript, do not include any leading or trailing sentences. You are not summarizing. You are cleaning the transcript. Do not include any xml tags <>
</instructions>
Assistant:

Transcript Processing

After the initial transcript is passed into the LLM, it returns a polished transcript, free from errors. The following are excerpts from the transcript:

Speaker Identification

Input:

spk_0: Hey Court , how’s it going ?
spk_1: Hey Adam, it’s going good . How are you
spk_0: doing ? Well , uh hey , thanks for uh joining me today on this call . I’m excited to talk to you about uh architecting on Aws .

Output:

Adam: Hey Court, how’s it going?
Court: Hey Adam, it’s going good. How are you?
Adam: Thanks for joining me today. I’m excited to talk to you about architecting on AWS

Homophone Replacement

Input:

Adam: Have you ex have you experienced uh my sequel attacks in the past or sequel injections ?
Court: Yes .

Output:

Adam: Have you experienced SQL injections in the past?
Court: Yes.

Filler Word Removal

Input:

Adam: Ok , great . Uh All right . And then are you guys bound by any compliance standards like PC I DS S or um you know GDR ? Uh what’s another one ? Any of those C just uh
Court: we are bound by fate , moderate complaints . So .
Adam: Ok . Well , you have to transition to fed ramp high at any time

Output:

Adam: Ok, great. And then are you guys bound by any compliance standards like PCI DSS or GDPR? What’s another one? Any of those? CJIS?
Court: We are bound by FedRAMP moderate compliance.
Adam: Ok. Will you have to transition to FedRAMP high at any time?

Prompt engineering for summarization

Now that the transcript has been created by Amazon Transcribe, diarized, and enhanced with Amazon Bedrock, the transcript can be summarized with Amazon Bedrock and an LLM.  A simple version might look like the following:

Human:
You are a transcript summarizing bot. You will go over the transcript below and provide a summary of the transcript.
Transcript: ${transcript}

Assistant:

Although this will work, it can be improved with additional instructions. You can use XML tags to provide structure and clarity to the task requirements:

Human: You are a transcript summarizing bot. You will go over the transcript below and provide a summary of the content within the <instructions> tags.
<transcript> ${transcript} </transcript>

<instructions>
– Go over the conversation that was had in the transcript.
– Create a summary based on what occurred in the meeting.
– Highlight specific action items that came up in the meeting, including follow-up tasks for each person.
</instructions>

Assistant:

Because meetings often involve action items and date/time information, instructions are added to explicitly request the LLM to include this information. Because the LLM knows the speaker names, each person is assigned specific action items for them if any are found. To prevent hallucinations, an instruction is included that allows the LLM to fail gracefully.

Human: You are a transcript summarizing bot. You will go over the transcript below and provide a summary of the content within the <instructions> tags.

<transcript> ${transcript} </transcript>

<instructions>
– Go over the conversation that was had in the transcript.
– Create a summary based on what occurred in the meeting.
– Highlight specific action items that came up in the meeting, including follow-up tasks for each person.
– If relevant, focus on what specific AWS services were mentioned during the conversation.
– If there’s sufficient context, infer the speaker’s role and mention it in the summary. For instance, “Bob, the customer/designer/sales rep/…”
– Include important date/time, and indicate urgency if necessary (e.g., deadline/ETAs for outcomes and next steps)
</instructions>

Assistant: Should I add anything else in my answer?

Human: If there is not enough context to generate a proper summary, then just return a string that says “Meeting not long enough to generate a transcript.

Assistant:

Alternatively, we invite you to explore Amazon Transcribe Call Analytics generative call summarization for an out-of-the-box solution that integrates directly with Amazon Transcribe.

Prompt engineering for meeting invitation parsing

Included in this demo is a React-based UI that will start the process of the SMA joining the meeting. Because this demo supports multiple meeting types, the invitation must be parsed. Rather than parse this with a complex regular expression (regex), it will be processed with an LLM. This prompt will first identify the meeting type: Amazon Chime, Zoom, Google Meet, Microsoft Teams, or Cisco Webex. Based on the meeting type, the LLM will extract the meeting ID and dial-in information. Simply copy/paste the meeting invitation to the UI, and the invitation will be processed by the LLM to determine how to call the meeting provider. This can be done for a meeting that is currently happening or scheduled for a future meeting. See the following example:

Human: You are an information extracting bot. Go over the meeting invitation below and determine what the meeting id and meeting type are <instructions></instructions> xml tags

<meeting_invitation>
${meetingInvitation}
</meeting_invitation>

<instructions>
1. Identify Meeting Type:
Determine if the meeting invitation is for Chime, Zoom, Google, Microsoft Teams, or Webex meetings.

2. Chime, Zoom, and Webex
– Find the meetingID
– Remove all spaces from the meeting ID (e.g., #### ## #### -> ##########).

3. If Google – Instructions Extract Meeting ID and Dial in
– For Google only, the meeting invitation will call a meetingID a ‘pin’, so treat it as a meetingID
– Remove all spaces from the meeting ID (e.g., #### ## #### -> ##########).
– Extract Google and Microsoft Dial-In Number (if applicable):
– If the meeting is a Google meeting, extract the unique dial-in number.
– Locate the dial-in number following the text “to join by phone dial.”
– Format the extracted Google dial-in number as (+1 ###-###-####), removing dashes and spaces. For example +1 111-111-1111 would become +11111111111)

4. If Microsoft Teams – Instructions if meeting type is Microsoft Teams.
– Pay attention to these instructions carefully
– The meetingId we want to store in the generated response is the ‘Phone Conference ID’ : ### ### ###
– in the meeting invitation, there are two IDs a ‘Meeting ID’ (### ### ### ##) and a ‘Phone Conference ID’ (### ### ###), ignore the ‘Meeting ID’ use the ‘Phone Conference ID’
– The meetingId we want to store in the generated response is the ‘Phone Conference ID’ : ### ### ###
– The meetingID that we want is referenced as the ‘Phone Conference ID’ store that one as the meeting ID.
– Find the phone number, extract it and store it as the dialIn number (format (+1 ###-###-####), removing dashes and spaces. For example +1 111-111-1111 would become +11111111111)

5. meetingType rules
– The only valid responses for meetingType are ‘Chime’, ‘Webex’, ‘Zoom’, ‘Google’, ‘Teams’

6. Generate Response:
– Create a response object with the following format:
{
meetingId: “meeting id goes here with spaces removed”,
meetingType: “meeting type goes here (options: ‘Chime’, ‘Webex’, ‘Zoom’, ‘Google’, ‘Teams’)”,
dialIn: “Insert Google/Microsoft Teams Dial-In number with no dashes or spaces, or N/A if not a Google/Microsoft Teams Meeting”
}

Meeting ID Formats:
Zoom: ### #### ####
Webex: #### ### ####
Chime: #### ## ####
Google: ### ### ####
Teams: ### ### ###

Ensure that the program does not create fake phone numbers and only includes the Microsoft or Google dial-in number if the meeting type is Google or Teams.

</instructions>

Assistant: Should I add anything else in my answer?

Human: Only return a JSON formatted response with the meetingid and meeting type associated to it. Do not add any other words to your answer. Do not add any introductory sentences in your answer.

Assistant:

With this information extracted from the invitation, a call is placed to the meeting provider so that the SMA can join the meeting as a participant.

Clean up

If you deployed this sample solution, clean up your resources by destroying the AWS CDK application from the AWS Command Line Interface (AWS CLI). This can be done using the following command:

yarn cdk destroy

Conclusion

In this post, we showed how to enhance Amazon Transcribe with an LLM using Amazon Bedrock by extracting information that would otherwise be difficult for a regex to extract. We also used this method to extract information from a meeting invitation sent from an unknown source. Finally, we showed how to use an LLM to provide a summarization of the meeting using detailed instructions to produce action items and include date/time information in the response.

We invite you to deploy this demo into your own account. We’d love to hear from you. Let us know what you think in the issues forum of the Amazon Chime SDK Meeting Summarizer GitHub repository. Alternatively, we invite you to explore other methods for meeting transcription and summarization, such as Amazon Live Meeting Assistant, which uses a browser extension to collect call audio.


About the authors

Adam Neumiller is a Solutions Architect for AWS. He is focused on helping public sector customers drive cloud adoption through the use of infrastructure as code. Outside of work, he enjoys spending time with his family and exploring the great outdoors.

Court Schuett is a Principal Specialist SA – GenAI focused on third party models and how they can be used to help solve customer problems.  When he’s not coding, he spends his time exploring parks, traveling with his family, and listening to music.

Christopher Lott is a Principal Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California, and enjoys gardening, cooking, aerospace/general aviation, and traveling the world.

Hang Su is a Senior Applied Scientist at AWS AI. He has been leading AWS Transcribe Contact Lens Science team. His interest lies in call-center analytics, LLM-based abstractive summarization, and general conversational AI.

Jason Cai is an Applied Scientist at AWS AI. He has made contributions to AWS Bedrock, Contact Lens, Lex and Transcribe. His interests include LLM agents, dialogue summarization, LLM prediction refinement, and knowledge graph.

Edgar Costa Filho is a Senior Cloud Infrastructure Architect with a focus on Foundations and Containers, including expertise in integrating Amazon EKS with open-source tooling like Crossplane, Terraform, and GitOps. In his role, Edgar is dedicated to assisting customers in achieving their business objectives by implementing best practices in cloud infrastructure design and management.

Read More