Introducing automatic training for solutions in Amazon Personalize

Introducing automatic training for solutions in Amazon Personalize

Amazon Personalize is excited to announce automatic training for solutions. Solution training is fundamental to maintain the effectiveness of a model and make sure recommendations align with users’ evolving behaviors and preferences. As data patterns and trends change over time, retraining the solution with the latest relevant data enables the model to learn and adapt, enhancing its predictive accuracy. Automatic training generates a new solution version, mitigating model drift and keeping recommendations relevant and tailored to end-users’ current behaviors while including the newest items. Ultimately, automatic training provides a more personalized and engaging experience that adapts to changing preferences.

Amazon Personalize accelerates your digital transformation with machine learning (ML), making it effortless to integrate personalized recommendations into existing websites, applications, email marketing systems, and more. Amazon Personalize enables developers to quickly implement a customized personalization engine, without requiring ML expertise. Amazon Personalize provisions the necessary infrastructure and manages the entire ML pipeline, including processing the data, identifying features, using the appropriate algorithms, and training, optimizing, and hosting the customized models based on your data. All your data is encrypted to be private and secure.

In this post, we guide you through the process of configuring automatic training, so your solutions and recommendations maintain their accuracy and relevance.

Solution overview

A solution refers to the combination of an Amazon Personalize recipe, customized parameters, and one or more solution versions (trained models). When you create a custom solution, you specify a recipe matching your use case and configure training parameters. For this post, you configure automatic training in the training parameters.

Prerequisites

To enable automatic training for your solutions, you first need to set up Amazon Personalize resources. Start by creating a dataset group, schemas, and datasets representing your items, interactions, and user data. For instructions, refer to Getting Started (console) or Getting Started (AWS CLI).

After you finish importing your data, you are ready to create a solution.

Create a solution

To set up automatic training, complete the following steps:

  1. On the Amazon Personalize console, create a new solution.
  2. Specify a name for your solution, choose the type of solution you want to create, and choose your recipe.
  3. Optionally, add any tags. For more information about tagging Amazon Personalize resources, see Tagging Amazon Personalize resources.
  4. To use automatic training, in the Automatic training section, select Turn on and specify your training frequency.

Automatic training is enabled by default to train one time every 7 days. You can configure the training cadence to suit your business needs, ranging from one time every 1–30 days.

  1. If your recipe generates item recommendations or user segments, optionally use the Columns for training section to choose the columns Amazon Personalize considers when training solution versions.
  2. In the Hyperparameter configuration section, optionally configure any hyperparameter options based on your recipe and business needs.
  3. Provide any additional configurations, then choose Next.
  4. Review the solution details and confirm that your automatic training is configured as expected.
  5. Choose Create solution.

Amazon Personalize will automatically create your first solution version. A solution version refers to a trained ML model. When a solution version is created for the solution, Amazon Personalize trains the model backing the solution version based on the recipe and training configuration. It can take up to 1 hour for the solution version creation to start.

The following is sample code for creating a solution with automatic training using the AWS SDK:

import boto3 
personalize = boto3.client('personalize')

solution_config = {
    "autoTrainingConfig": {
        "schedulingExpression": "rate(3 days)"
    }
}

recipe = "arn:aws:personalize:::recipe/aws-similar-items"
name = "test_automatic_training"
response = personalize.create_solution(name=name, recipeArn=recipe_arn, datasetGroupArn=dataset_group_arn, 
                            performAutoTraining=True, solutionConfig=solution_config)

print(response['solutionArn'])
solution_arn = response['solutionArn'])

After a solution is created, you can confirm whether automatic training is enabled on the solution details page.

You can also use the following sample code to confirm via the AWS SDK that automatic training is enabled:

response = personalize.describe_solution(solutionArn=solution_arn)
print(response)

Your response will contain the fields performAutoTraining and autoTrainingConfig, displaying the values you set in the CreateSolution call.

On the solution details page, you will also see the solution versions that are created automatically. The Training type column specifies whether the solution version was created manually or automatically.

You can also use the following sample code to return a list of solution versions for the given solution:

response = personalize.list_solution_versions(solutionArn=solution_arn)['solutionVersions']
print("List Solution Version responsen")
for val in response:
    print(f"SolutionVersion: {val}")
    print("n")

Your response will contain the field trainingType, which specifies whether the solution version was created manually or automatically.

When your solution version is ready, you can create a campaign for your solution version.

Create a campaign

A campaign deploys a solution version (trained model) to generate real-time recommendations. With Amazon Personalize, you can streamline your workflow and automate the deployment of the latest solution version to campaigns via automatic syncing. To set up auto sync, complete the following steps:

  1. On the Amazon Personalize console, create a new campaign.
  2. Specify a name for your campaign.
  3. Choose the solution you just created.
  4. Select Automatically use the latest solution version.
  5. Set the minimum provisioned transactions per second.
  6. Create your campaign.

The campaign is ready when its status is ACTIVE.

The following is sample code for creating a campaign with syncWithLatestSolutionVersion set to true using the AWS SDK. You must also append the suffix $LATEST to the solutionArn in solutionVersionArn when you set syncWithLatestSolutionVersion to true.

campaign_config = {
    "syncWithLatestSolutionVersion": True
}
resource_name = "test_campaign_sync"
solution_version_arn = "arn:aws:personalize:<region>:<accountId>:solution/<solution_name>/$LATEST"
response = personalize.create_campaign(name=resource_name, solutionVersionArn=solution_version_arn, campaignConfig=campaign_config)
campaign_arn = response['campaignArn']
print(campaign_arn)

On the campaign details page, you can see whether the campaign selected has auto sync enabled. When enabled, your campaign will automatically update to use the most recent solution version, whether it was automatically or manually created.

Use the following sample code to confirm via the AWS SDK that syncWithLatestSolutionVersion is enabled:

response = personalize.describe_campaign(campaignArn=campaign_arn)
Print(response)

Your response will contain the field syncWithLatestSolutionVersion under campaignConfig, displaying the value you set in the CreateCampaign call.

You can enable or disable the option to automatically use the latest solution version on the Amazon Personalize console after a campaign is created by updating your campaign. Similarly, you can enable or disable syncWithLatestSolutionVersion with UpdateCampaign using the AWS SDK.

Conclusion

With automatic training, you can mitigate model drift and maintain recommendation relevance by streamlining your workflow and automating the deployment of the latest solution version in Amazon Personalize.

For more information about optimizing your user experience with Amazon Personalize, see the Amazon Personalize Developer Guide.


About the authors

Ba’Carri Johnson is a Sr. Technical Product Manager working with AWS AI/ML on the Amazon Personalize team. With a background in computer science and strategy, she is passionate about product innovation. In her spare time, she enjoys traveling and exploring the great outdoors.

Ajay Venkatakrishnan is a Software Development Engineer on the Amazon Personalize team. In his spare time, he enjoys writing and playing soccer.

Pranesh Anubhav is a Senior Software Engineer for Amazon Personalize. He is passionate about designing machine learning systems to serve customers at scale. Outside of his work, he loves playing soccer and is an avid follower of Real Madrid.

Read More

Use Kubernetes Operators for new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

Use Kubernetes Operators for new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

We are excited to announce a new version of the Amazon SageMaker Operators for Kubernetes using the AWS Controllers for Kubernetes (ACK). ACK is a framework for building Kubernetes custom controllers, where each controller communicates with an AWS service API. These controllers allow Kubernetes users to provision AWS resources like buckets, databases, or message queues simply by using the Kubernetes API.

Release v1.2.9 of the SageMaker ACK Operators adds support for inference components, which until now were only available through the SageMaker API and the AWS Software Development Kits (SDKs). Inference components can help you optimize deployment costs and reduce latency. With the new inference component capabilities, you can deploy one or more foundation models (FMs) on the same Amazon SageMaker endpoint and control how many accelerators and how much memory is reserved for each FM. This helps improve resource utilization, reduces model deployment costs on average by 50%, and lets you scale endpoints together with your use cases. For more details, see Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency.

The availability of inference components through the SageMaker controller enables customers who use Kubernetes as their control plane to take advantage of inference components while deploying their models on SageMaker.

In this post, we show how to use SageMaker ACK Operators to deploy SageMaker inference components.

How ACK works

To demonstrate how ACK works, let’s look at an example using Amazon Simple Storage Service (Amazon S3). In the following diagram, Alice is our Kubernetes user. Her application depends on the existence of an S3 bucket named my-bucket.

How ACK Works

The workflow consists of the following steps:

  1. Alice issues a call to kubectl apply, passing in a file that describes a Kubernetes custom resource describing her S3 bucket. kubectl apply passes this file, called a manifest, to the Kubernetes API server running in the Kubernetes controller node.
  2. The Kubernetes API server receives the manifest describing the S3 bucket and determines if Alice has permissions to create a custom resource of kind s3.services.k8s.aws/Bucket, and that the custom resource is properly formatted.
  3. If Alice is authorized and the custom resource is valid, the Kubernetes API server writes the custom resource to its etcd data store.
  4. It then responds to Alice that the custom resource has been created.
  5. At this point, the ACK service controller for Amazon S3, which is running on a Kubernetes worker node within the context of a normal Kubernetes Pod, is notified that a new custom resource of kind s3.services.k8s.aws/Bucket has been created.
  6. The ACK service controller for Amazon S3 then communicates with the Amazon S3 API, calling the S3 CreateBucket API to create the bucket in AWS.
  7. After communicating with the Amazon S3 API, the ACK service controller calls the Kubernetes API server to update the custom resource’s status with information it received from Amazon S3.

Key components

The new inference capabilities build upon SageMaker’s real-time inference endpoints. As before, you create the SageMaker endpoint with an endpoint configuration that defines the instance type and initial instance count for the endpoint. The model is configured in a new construct, an inference component. Here, you specify the number of accelerators and amount of memory you want to allocate to each copy of a model, together with the model artifacts, container image, and number of model copies to deploy.

You can use the new inference capabilities from Amazon SageMaker Studio, the SageMaker Python SDK, AWS SDKs, and AWS Command Line Interface (AWS CLI). They are also supported by AWS CloudFormation. Now you also can use them with SageMaker Operators for Kubernetes.

Solution overview

For this demo, we use the SageMaker controller to deploy a copy of the Dolly v2 7B model and a copy of the FLAN-T5 XXL model from the Hugging Face Model Hub on a SageMaker real-time endpoint using the new inference capabilities.

Prerequisites

To follow along, you should have a Kubernetes cluster with the SageMaker ACK controller v1.2.9 or above installed. For instructions on how to provision an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with Amazon Elastic Compute Cloud (Amazon EC2) Linux managed nodes using eksctl, see Getting started with Amazon EKS – eksctl. For instructions on installing the SageMaker controller, refer to Machine Learning with the ACK SageMaker Controller.

You need access to accelerated instances (GPUs) for hosting the LLMs. This solution uses one instance of ml.g5.12xlarge; you can check the availability of these instances in your AWS account and request these instances as needed via a Service Quotas increase request, as shown in the following screenshot.

Service Quotas Increase Request

Create an inference component

To create your inference component, define the EndpointConfig, Endpoint, Model, and InferenceComponent YAML files, similar to the ones shown in this section. Use kubectl apply -f <yaml file> to create the Kubernetes resources.

You can list the status of the resource via kubectl describe <resource-type>; for example, kubectl describe inferencecomponent.

You can also create the inference component without a model resource. Refer to the guidance provided in the API documentation for more details.

EndpointConfig YAML

The following is the code for the EndpointConfig file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: EndpointConfig
metadata:
  name: inference-component-endpoint-config
spec:
  endpointConfigName: inference-component-endpoint-config
  executionRoleARN: <EXECUTION_ROLE_ARN>
  productionVariants:
  - variantName: AllTraffic
    instanceType: ml.g5.12xlarge
    initialInstanceCount: 1
    routingConfig:
      routingStrategy: LEAST_OUTSTANDING_REQUESTS

Endpoint YAML

The following is the code for the Endpoint file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Endpoint
metadata:
  name: inference-component-endpoint
spec:
  endpointName: inference-component-endpoint
  endpointConfigName: inference-component-endpoint-config

Model YAML

The following is the code for the Model file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Model
metadata:
  name: dolly-v2-7b
spec:
  modelName: dolly-v2-7b
  executionRoleARN: <EXECUTION_ROLE_ARN>
  containers:
  - image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04
    environment:
      HF_MODEL_ID: databricks/dolly-v2-7b
      HF_TASK: text-generation
---
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Model
metadata:
  name: flan-t5-xxl
spec:
  modelName: flan-t5-xxl
  executionRoleARN: <EXECUTION_ROLE_ARN>
  containers:
  - image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04
    environment:
      HF_MODEL_ID: google/flan-t5-xxl
      HF_TASK: text-generation

InferenceComponent YAMLs

In the following YAML files, given that the ml.g5.12xlarge instance comes with 4 GPUs, we are allocating 2 GPUs, 2 CPUs and 1,024 MB of memory to each model:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-dolly
spec:
  inferenceComponentName: inference-component-dolly
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: dolly-v2-7b
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 2
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-flan
spec:
  inferenceComponentName: inference-component-flan
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: flan-t5-xxl
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 2
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1

Invoke models

You can now invoke the models using the following code:

import boto3
import json

sm_runtime_client = boto3.client(service_name="sagemaker-runtime")
payload = {"inputs": "Why is California a great place to live?"}

response_dolly = sm_runtime_client.invoke_endpoint(
    EndpointName="inference-component-endpoint",
    InferenceComponentName="inference-component-dolly",
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(payload),
)
result_dolly = json.loads(response_dolly['Body'].read().decode())
print(result_dolly)

response_flan = sm_runtime_client.invoke_endpoint(
    EndpointName="inference-component-endpoint",
    InferenceComponentName="inference-component-flan",
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(payload),
)
result_flan = json.loads(response_flan['Body'].read().decode())
print(result_flan)

Update an inference component

To update an existing inference component, you can update the YAML files and then use kubectl apply -f <yaml file>. The following is an example of an updated file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-dolly
spec:
  inferenceComponentName: inference-component-dolly
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: dolly-v2-7b
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 4 # Update the numberOfCPUCoresRequired.
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1

Delete an inference component

To delete an existing inference component, use the command kubectl delete -f <yaml file>.

Availability and pricing

The new SageMaker inference capabilities are available today in AWS Regions US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Jakarta, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Stockholm), Middle East (UAE), and South America (São Paulo). For pricing details, visit Amazon SageMaker Pricing.

Conclusion

In this post, we showed how to use SageMaker ACK Operators to deploy SageMaker inference components. Fire up your Kubernetes cluster and deploy your FMs using the new SageMaker inference capabilities today!


About the Authors

Rajesh Ramchander is a Principal ML Engineer in Professional Services at AWS. He helps customers at various stages in their AI/ML and GenAI journey, from those that are just getting started all the way to those that are leading their business with an AI-first strategy.

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Suryansh Singh is a Software Development Engineer at AWS SageMaker and works on developing ML-distributed infrastructure solutions for AWS customers at scale.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Johna LiuJohna Liu is a Software Development Engineer in the Amazon SageMaker team. Her current work focuses on helping developers efficiently host machine learning models and improve inference performance. She is passionate about spatial data analysis and using AI to solve societal problems.

Read More

Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker – Part 2

Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker – Part 2

In Part 1 of this series, we presented a solution that used the Amazon Titan Multimodal Embeddings model to convert individual slides from a slide deck into embeddings. We stored the embeddings in a vector database and then used the Large Language-and-Vision Assistant (LLaVA 1.5-7b) model to generate text responses to user questions based on the most similar slide retrieved from the vector database. We used AWS services including Amazon Bedrock, Amazon SageMaker, and Amazon OpenSearch Serverless in this solution.

In this post, we demonstrate a different approach. We use the Anthropic Claude 3 Sonnet model to generate text descriptions for each slide in the slide deck. These descriptions are then converted into text embeddings using the Amazon Titan Text Embeddings model and stored in a vector database. Then we use the Claude 3 Sonnet model to generate answers to user questions based on the most relevant text description retrieved from the vector database.

You can test both approaches for your dataset and evaluate the results to see which approach gives you the best results. In Part 3 of this series, we evaluate the results of both methods.

Solution overview

The solution provides an implementation for answering questions using information contained in text and visual elements of a slide deck. The design relies on the concept of Retrieval Augmented Generation (RAG). Traditionally, RAG has been associated with textual data that can be processed by large language models (LLMs). In this series, we extend RAG to include images as well. This provides a powerful search capability to extract contextually relevant content from visual elements like tables and graphs along with text.

This solution includes the following components:

  • Amazon Titan Text Embeddings is a text embeddings model that converts natural language text, including single words, phrases, or even large documents, into numerical representations that can be used to power use cases such as search, personalization, and clustering based on semantic similarity.
  • Claude 3 Sonnet is the next generation of state-of-the-art models from Anthropic. Sonnet is a versatile tool that can handle a wide range of tasks, from complex reasoning and analysis to rapid outputs, as well as efficient search and retrieval across vast amounts of information.
  • OpenSearch Serverless is an on-demand serverless configuration for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the Amazon Titan Text Embeddings model. An index created in the OpenSearch Serverless collection serves as the vector store for our RAG solution.
  • Amazon OpenSearch Ingestion (OSI) is a fully managed, serverless data collector that delivers data to OpenSearch Service domains and OpenSearch Serverless collections. In this post, we use an OSI pipeline API to deliver data to the OpenSearch Serverless vector store.

The solution design consists of two parts: ingestion and user interaction. During ingestion, we process the input slide deck by converting each slide into an image, generating descriptions and text embeddings for each image. We then populate the vector data store with the embeddings and text description for each slide. These steps are completed prior to the user interaction steps.

In the user interaction phase, a question from the user is converted into text embeddings. A similarity search is run on the vector database to find a text description corresponding to a slide that could potentially contain answers to the user question. We then provide the slide description and the user question to the Claude 3 Sonnet model to generate an answer to the query. All the code for this post is available in the GitHub repo.

The following diagram illustrates the ingestion architecture.

The workflow consists of the following steps:

  1. Slides are converted to image files (one per slide) in JPG format and passed to the Claude 3 Sonnet model to generate text description.
  2. The data is sent to the Amazon Titan Text Embeddings model to generate embeddings. In this series, we use the slide deck Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia from the AWS Summit in Toronto, June 2023 to demonstrate the solution. The sample deck has 31 slides, therefore we generate 31 sets of vector embeddings, each with 1536 dimensions. We add additional metadata fields to perform rich search queries using OpenSearch’s powerful search capabilities.
  3. The embeddings are ingested into an OSI pipeline using an API call.
  4. The OSI pipeline ingests the data as documents into an OpenSearch Serverless index. The index is configured as the sink for this pipeline and is created as part of the OpenSearch Serverless collection.

The following diagram illustrates the user interaction architecture.

The workflow consists of the following steps:

  1. A user submits a question related to the slide deck that has been ingested.
  2. The user input is converted into embeddings using the Amazon Titan Text Embeddings model accessed using Amazon Bedrock. An OpenSearch Service vector search is performed using these embeddings. We perform a k-nearest neighbor (k-NN) search to retrieve the most relevant embeddings matching the user query.
  3. The metadata of the response from OpenSearch Serverless contains a path to the image and description corresponding to the most relevant slide.
  4. A prompt is created by combining the user question and the image description. The prompt is provided to Claude 3 Sonnet hosted on Amazon Bedrock.
  5. The result of this inference is returned to the user.

We discuss the steps for both stages in the following sections, and include details about the output.

Prerequisites

To implement the solution provided in this post, you should have an AWS account and familiarity with FMs, Amazon Bedrock, SageMaker, and OpenSearch Service.

This solution uses the Claude 3 Sonnet and Amazon Titan Text Embeddings models hosted on Amazon Bedrock. Make sure that these models are enabled for use by navigating to the Model access page on the Amazon Bedrock console.

If models are enabled, the Access status will state Access granted.

If the models are not available, enable access by choosing Manage model access, selecting the models, and choosing Request model access. The models are enabled for use immediately.

Use AWS CloudFormation to create the solution stack

You can use AWS CloudFormation to create the solution stack. If you have created the solution for Part 1 in the same AWS account, be sure to delete that before creating this stack.

AWS Region Link
us-east-1
us-west-2

After the stack is created successfully, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the values for MultimodalCollectionEndpoint and OpenSearchPipelineEndpoint. You use these in the subsequent steps.

The CloudFormation template creates the following resources:

  • IAM roles – The following AWS Identity and Access Management (IAM) roles are created. Update these roles to apply least-privilege permissions, as discussed in Security best practices.
    • SMExecutionRole with Amazon Simple Storage Service (Amazon S3), SageMaker, OpenSearch Service, and Amazon Bedrock full access.
    • OSPipelineExecutionRole with access to the S3 bucket and OSI actions.
  • SageMaker notebook – All code for this post is run using this notebook.
  • OpenSearch Serverless collection – This is the vector database for storing and retrieving embeddings.
  • OSI pipeline – This is the pipeline for ingesting data into OpenSearch Serverless.
  • S3 bucket – All data for this post is stored in this bucket.

The CloudFormation template sets up the pipeline configuration required to configure the OSI pipeline with HTTP as source and the OpenSearch Serverless index as sink. The SageMaker notebook 2_data_ingestion.ipynb displays how to ingest data into the pipeline using the Requests HTTP library.

The CloudFormation template also creates network, encryption and data access policies required for your OpenSearch Serverless collection. Update these policies to apply least-privilege permissions.

The CloudFormation template name and OpenSearch Service index name are referenced in the SageMaker notebook 3_rag_inference.ipynb. If you change the default names, make sure you update them in the notebook.

Test the solution

After you have created the CloudFormation stack, you can test the solution. Complete the following steps:

  1. On the SageMaker console, choose Notebooks in the navigation pane.
  2. Select MultimodalNotebookInstance and choose Open JupyterLab.
  3. In File Browser, traverse to the notebooks folder to see notebooks and supporting files.

The notebooks are numbered in the sequence in which they run. Instructions and comments in each notebook describe the actions performed by that notebook. We run these notebooks one by one.

  1. Choose 1_data_prep.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook will download a publicly available slide deck, convert each slide into the JPG file format, and upload these to the S3 bucket.

  1. Choose 2_data_ingestion.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

In this notebook, you create an index in the OpenSearch Serverless collection. This index stores the embeddings data for the slide deck. See the following code:

session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
  hosts = [{'host': host, 'port': 443}],
  http_auth = auth,
  use_ssl = True,
  verify_certs = True,
  connection_class = RequestsHttpConnection,
  pool_maxsize = 20
)

index_body = """
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "parameters": {}
        }
      },
      "image_path": {
        "type": "text"
      },
      "slide_text": {
        "type": "text"
      },
      "slide_number": {
        "type": "text"
      },
      "metadata": { 
        "properties" :
          {
            "filename" : {
              "type" : "text"
            },
            "desc":{
              "type": "text"
            }
          }
      }
    }
  }
}
"""
index_body = json.loads(index_body)
try:
  response = os_client.indices.create(index_name, body=index_body)
  logger.info(f"response received for the create index -> {response}")
except Exception as e:
  logger.error(f"error in creating index={index_name}, exception={e}")

You use the Claude 3 Sonnet and Amazon Titan Text Embeddings models to convert the JPG images created in the previous notebook into vector embeddings. These embeddings and additional metadata (such as the S3 path and description of the image file) are stored in the index along with the embeddings. The following code snippet shows how Claude 3 Sonnet generates image descriptions:

def get_img_desc(image_file_path: str, prompt: str):
    # read the file, MAX image size supported is 2048 * 2048 pixels
    with open(image_file_path, "rb") as image_file:
        input_image_b64 = image_file.read().decode('utf-8')
  
    body = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1000,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/jpeg",
                                "data": input_image_b64
                            },
                        },
                        {"type": "text", "text": prompt},
                    ],
                }
            ],
        }
    )
    
    response = bedrock.invoke_model(
        modelId=g.CLAUDE_MODEL_ID,
        body=body
    )

    resp_body = json.loads(response['body'].read().decode("utf-8"))
    resp_text = resp_body['content'][0]['text'].replace('"', "'")

    return resp_text

The image descriptions are passed to the Amazon Titan Text Embeddings model to generate vector embeddings. These embeddings and additional metadata (such as the S3 path and description of the image file) are stored in the index along with the embeddings. The following code snippet shows the call to the Amazon Titan Text Embeddings model:

def get_text_embedding(bedrock: botocore.client, prompt_data: str) -> np.ndarray:
    body = json.dumps({
        "inputText": prompt_data,
    })    
    try:
        response = bedrock.invoke_model(
            body=body, modelId=g.TITAN_MODEL_ID, accept=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.loads(response['body'].read())
        embedding = response_body.get('embedding')
    except Exception as e:
        logger.error(f"exception={e}")
        embedding = None

    return embedding

The data is ingested into the OpenSearch Serverless index by making an API call to the OSI pipeline. The following code snippet shows the call made using the Requests HTTP library:

data = json.dumps([{
    "image_path": input_image_s3, 
    "slide_text": resp_text, 
    "slide_number": slide_number, 
    "metadata": {
        "filename": obj_name, 
        "desc": "" 
    }, 
    "vector_embedding": embedding
}])

r = requests.request(
    method='POST', 
    url=osi_endpoint, 
    data=data,
    auth=AWSSigV4('osis'))
  1. Choose 3_rag_inference.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook implements the RAG solution: you convert the user question into embeddings, find a similar image description from the vector database, and provide the retrieved description to Claude 3 Sonnet to generate an answer to the user question. You use the following prompt template:

  llm_prompt: str = """

  Human: Use the summary to provide a concise answer to the question to the best of your abilities. If you cannot answer the question from the context then say I do not know, do not make up an answer.
  <question>
  {question}
  </question>

  <summary>
  {summary}
  </summary>

  Assistant:"""

The following code snippet provides the RAG workflow:

def get_llm_response(bedrock: botocore.client, question: str, summary: str) -> str:
    prompt = llm_prompt.format(question=question, summary=summary)
    
    body = json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            }
        ],
    })
        
    try:
        response = bedrock.invoke_model(
        modelId=g.CLAUDE_MODEL_ID,
        body=body)

        response_body = json.loads(response['body'].read().decode("utf-8"))
        llm_response = response_body['content'][0]['text'].replace('"', "'")
        
    except Exception as e:
        logger.error(f"exception while slide_text={summary[:10]}, exception={e}")
        llm_response = None

    return llm_response


# create prompt and convert to embeddings question: str = "How does Inf2 compare in performance to comparable EC2 instances? I need numbers." text_embedding = get_text_embedding(bedrock, question) # vector db search vector_db_response: Dict = find_similar_data(text_embeddings) # download image for local notebook display s3_img_path = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('image_path') logger.info(f"going to answer the question="{question}" using the image "{s3_img_path}"") !aws s3 cp {s3_img_path} . local_img_path = os.path.basename(s3_img_path) display(filename=local_img_path) # Ask Claude 3 Sonnet slide_text = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('slide_text') llm_response = get_llm_response(bedrock, question, slide_text) print(llm_response)

Results

The following table contains some user questions and responses generated by our implementation. The Question column captures the user question, and the Answer column is the textual response generated by Claude 3 Sonnet. The Image column shows the k-NN slide match returned by the OpenSearch Serverless vector search.

Multimodal RAG results

Question Answer Image
How does Inf2 compare in performance to comparable EC2 instances? I need numbers. Based on the information provided in the summary, the key performance comparison numbers given for the Amazon EC2 Inf2 instances compared to other instances are: 1. Up to 4x higher throughput. 2. Up to 10x lower latency. The summary does not provide specific numbers comparing the Inf2 instances to other EC2 instance types. However, it highlights the throughput and latency improvements as the main performance advantages of the Inf2 instances over comparable instances for running generative AI models.
As per the AI/ML flywheel, what do the AWS AI/ML services provide? Based on the summary, the AWS AI/ML services provide a range of capabilities that fuel an AI/ML flywheel. This flywheel consists of innovation in developing new and better algorithms, enabling more use-cases and applications, driving wider adoption and demand, which in turn leads to further investment in research and optimization to drive more innovation. The AWS AI/ML services seem to offer the tools, resources, and infrastructure to support this continuous cycle of innovation, application development, adoption, and reinvestment in the field of artificial intelligence and machine learning.
Compared to GPT-2, how many more parameters does GPT-3 have? What is the numerical difference between the parameter size of GPT-2 and GPT-3? According to the information provided in the summary, GPT-3 from 2020 had 175B (175 billion) parameters, while GPT-2 from 2019 had 1.5B (1.5 billion) parameters. Therefore, the numerical difference between the parameter size of GPT-2 and GPT-3 is 173.5 billion parameters. GPT-3 has approximately 115 times more parameters than GPT-2.
What are quarks in particle physics? Unfortunately, the given summary does not contain any information about quarks in particle physics. The summary describes an image related to the progression of natural language processing and generative AI technologies, but it does not mention anything about particle physics or the concept of quarks.

Query your index

You can use OpenSearch Dashboards to interact with the OpenSearch API to run quick tests on your index and ingested data.

Cleanup

To avoid incurring future charges, delete the resources. You can do this by deleting the stack using the AWS CloudFormation console.

Conclusion

Enterprises generate new content all the time, and slide decks are a common way to share and disseminate information internally within the organization and externally with customers or at conferences. Over time, rich information can remain buried and hidden in non-text modalities like graphs and tables in these slide decks.

You can use this solution and the power of multimodal FMs such as the Amazon Titan Text Embeddings and Claude 3 Sonnet to discover new information or uncover new perspectives on content in slide decks. You can try different Claude models available on Amazon Bedrock by updating the CLAUDE_MODEL_ID in the globals.py file.

This is Part 2 of a three-part series. We used the Amazon Titan Multimodal Embeddings and the LLaVA model in Part 1. In Part 3, we will compare the approaches from Part 1 and Part 2.

Portions of this code are released under the Apache 2.0 License.


About the authors

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Manju Prasad is a Senior Solutions Architect at Amazon Web Services. She focuses on providing technical guidance in a variety of technical domains, including AI/ML. Prior to joining AWS, she designed and built solutions for companies in the financial services sector and also for a startup. She is passionate about sharing knowledge and fostering interest in emerging talent.

Archana Inapudi is a Senior Solutions Architect at AWS, supporting a strategic customer. She has over a decade of cross-industry expertise leading strategic technical initiatives. Archana is an aspiring member of the AI/ML technical field community at AWS. Prior to joining AWS, Archana led a migration from traditional siloed data sources to Hadoop at a healthcare company. She is passionate about using technology to accelerate growth, provide value to customers, and achieve business outcomes.

Antara Raisa is an AI and ML Solutions Architect at Amazon Web Services, supporting strategic customers based out of Dallas, Texas. She also has previous experience working with large enterprise partners at AWS, where she worked as a Partner Success Solutions Architect for digital-centered customers.

Read More

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

This is a guest post co-written with the leadership team of Iambic Therapeutics.

Iambic Therapeutics is a drug discovery startup with a mission to create innovative AI-driven technologies to bring better medicines to cancer patients, faster.

Our advanced generative and predictive artificial intelligence (AI) tools enable us to search the vast space of possible drug molecules faster and more effectively. Our technologies are versatile and applicable across therapeutic areas, protein classes, and mechanisms of action. Beyond creating differentiated AI tools, we have established an integrated platform that merges AI software, cloud-based data, scalable computation infrastructure, and high-throughput chemistry and biology capabilities. The platform both enables our AI—by supplying data to refine our models—and is enabled by it, capitalizing on opportunities for automated decision-making and data processing.

We measure success by our ability to produce superior clinical candidates to address urgent patient need, at unprecedented speed: we advanced from program launch to clinical candidates in just 24 months, significantly faster than our competitors.

In this post, we focus on how we used Karpenter on Amazon Elastic Kubernetes Service (Amazon EKS) to scale AI training and inference, which are core elements of the Iambic discovery platform.

The need for scalable AI training and inference

Every week, Iambic performs AI inference across dozens of models and millions of molecules, serving two primary use cases:

  • Medicinal chemists and other scientists use our web application, Insight, to explore chemical space, access and interpret experimental data, and predict properties of newly designed molecules. All of this work is done interactively in real time, creating a need for inference with low latency and medium throughput.
  • At the same time, our generative AI models automatically design molecules targeting improvement across numerous properties, searching millions of candidates, and requiring enormous throughput and medium latency.

Guided by AI technologies and expert drug hunters, our experimental platform generates thousands of unique molecules each week, and each is subjected to multiple biological assays. The generated data points are automatically processed and used to fine-tune our AI models every week. Initially, our model fine-tuning took hours of CPU time, so a framework for scaling model fine-tuning on GPUs was imperative.

Our deep learning models have non-trivial requirements: they are gigabytes in size, are numerous and heterogeneous, and require GPUs for fast inference and fine-tuning. Looking to cloud infrastructure, we needed a system that allows us to access GPUs, scale up and down quickly to handle spiky, heterogeneous workloads, and run large Docker images.

We wanted to build a scalable system to support AI training and inference. We use Amazon EKS and were looking for the best solution to auto scale our worker nodes. We chose Karpenter for Kubernetes node auto scaling for a number of reasons:

  • Ease of integration with Kubernetes, using Kubernetes semantics to define node requirements and pod specs for scaling
  • Low-latency scale-out of nodes
  • Ease of integration with our infrastructure as code tooling (Terraform)

The node provisioners support effortless integration with Amazon EKS and other AWS resources such as Amazon Elastic Compute Cloud (Amazon EC2) instances and Amazon Elastic Block Store volumes. The Kubernetes semantics used by the provisioners support directed scheduling using Kubernetes constructs such as taints or tolerations and affinity or anti-affinity specifications; they also facilitate control over the number and types of GPU instances that may be scheduled by Karpenter.

Solution overview

In this section, we present a generic architecture that is similar to the one we use for our own workloads, which allows elastic deployment of models using efficient auto scaling based on custom metrics.

The following diagram illustrates the solution architecture.

The architecture deploys a simple service in a Kubernetes pod within an EKS cluster. This could be a model inference, data simulation, or any other containerized service, accessible by HTTP request. The service is exposed behind a reverse-proxy using Traefik. The reverse proxy collects metrics about calls to the service and exposes them via a standard metrics API to Prometheus. The Kubernetes Event Driven Autoscaler (KEDA) is configured to automatically scale the number of service pods, based on the custom metrics available in Prometheus. Here we use the number of requests per second as a custom metric. The same architectural approach applies if you choose a different metric for your workload.

Karpenter monitors for any pending pods that can’t run due to lack of sufficient resources in the cluster. If such pods are detected, Karpenter adds more nodes to the cluster to provide the necessary resources. Conversely, if there are more nodes in the cluster than what is needed by the scheduled pods, Karpenter removes some of the worker nodes and the pods get rescheduled, consolidating them on fewer instances. The number of HTTP requests per second and number of nodes can be visualized using a Grafana dashboard. To demonstrate auto scaling, we run one or more simple load-generating pods, which send HTTP requests to the service using curl.

Solution deployment

In the step-by-step walkthrough, we use AWS Cloud9 as an environment to deploy the architecture. This enables all steps to be completed from a web browser. You can also deploy the solution from a local computer or EC2 instance.

To simplify deployment and improve reproducibility, we follow the principles of the do-framework and the structure of the depend-on-docker template. We clone the aws-do-eks project and, using Docker, we build a container image that is equipped with the necessary tooling and scripts. Within the container, we run through all the steps of the end-to-end walkthrough, from creating an EKS cluster with Karpenter to scaling EC2 instances.

For the example in this post, we use the following EKS cluster manifest:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: do-eks-yaml-karpenter
version: '1.28'
region: us-west-2
tags:
  karpenter.sh/discovery: do-eks-yaml-karpenter
iam:
  withOIDC: true
addons:
  - name: aws-ebs-csi-driver
    version: v1.26.0-eksbuild.1
wellKnownPolicies:
  ebsCSIController: true
managedNodeGroups:
  - name: c5-xl-do-eks-karpenter-ng
    instanceType: c5.xlarge
    instancePrefix: c5-xl
    privateNetworking: true
    minSize: 0
    desiredCapacity: 2
    maxSize: 10
    volumeSize: 300
    iam:
      withAddonPolicies:
        cloudWatch: true
        ebs: true

This manifest defines a cluster named do-eks-yaml-karpenter with the EBS CSI driver installed as an add-on. A managed node group with two c5.xlarge nodes is included to run system pods that are needed by the cluster. The worker nodes are hosted in private subnets, and the cluster API endpoint is public by default.

You could also use an existing EKS cluster instead of creating one. We deploy Karpenter by following the instructions in the Karpenter documentation, or by running the following script, which automates the deployment instructions.

The following code shows the Karpenter configuration we use in this example:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata: null
    labels:
      cluster-name: do-eks-yaml-karpenter
    annotations:
      purpose: karpenter-example
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default
        requirements:
          - key: karpenter.sh/capacity-type
            operator: In
            values:
              - spot
              - on-demand
          - key: karpenter.k8s.aws/instance-category
            operator: In
            values:
              - c
              - m
              - r
              - g
              - p
          - key: karpenter.k8s.aws/instance-generation
            operator: Gt
            values:
              - '2'
  disruption:
    consolidationPolicy: WhenUnderutilized
    #consolidationPolicy: WhenEmpty
    #consolidateAfter: 30s
    expireAfter: 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  subnetSelectorTerms:
    - tags:
      karpenter.sh/discovery: "do-eks-yaml-karpenter"
  securityGroupSelectorTerms:
    - tags:
      karpenter.sh/discovery: "do-eks-yaml-karpenter"
  role: "KarpenterNodeRole-do-eks-yaml-karpenter"
  tags:
    app: autoscaling-test
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 80Gi
        volumeType: gp3
        iops: 10000
        deleteOnTermination: true
        throughput: 125
  detailedMonitoring: true

We define a default Karpenter NodePool with the following requirements:

  • Karpenter can launch instances from both spot and on-demand capacity pools
  • Instances must be from the “c” (compute optimized), “m” (general purpose), “r” (memory optimized), or “g” and “p” (GPU accelerated) computing families
  • Instance generation must be greater than 2; for example, g3 is acceptable, but g2 is not

The default NodePool also defines disruption policies. Underutilized nodes will be removed so pods can be consolidated to run on fewer or smaller nodes. Alternatively, we can configure empty nodes to be removed after the specified time period. The expireAfter setting specifies the maximum lifetime of any node, before it is stopped and replaced if necessary. This helps reduce security vulnerabilities as well as avoid issues that are typical for nodes with long uptimes, such as file fragmentation or memory leaks.

By default, Karpenter provisions nodes with a small root volume, which can be insufficient for running AI or machine learning (ML) workloads. Some of the deep learning container images can be tens of GB in size, and we need to make sure there is enough storage space on the nodes to run pods using these images. To do that, we define EC2NodeClass with blockDeviceMappings, as shown in the preceding code.

Karpenter is responsible for auto scaling at the cluster level. To configure auto scaling at the pod level, we use KEDA to define a custom resource called ScaledObject, as shown in the following code:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: keda-prometheus-hpa
  namespace: hpa-example
spec:
  scaleTargetRef:
    name: php-apache
  minReplicaCount: 1
  cooldownPeriod: 30
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus- server.prometheus.svc.cluster.local:80
        metricName: http_requests_total
        threshold: '1'
        query: rate(traefik_service_requests_total{service="hpa-example-php-apache-80@kubernetes",code="200"}[2m])

The preceding manifest defines a ScaledObject named keda-prometheus-hpa, which is responsible for scaling the php-apache deployment and always keeps at least one replica running. It scales the pods of this deployment based on the metric http_requests_total available in Prometheus obtained by the specified query, and targets to scale up the pods so that each pod serves no more than one request per second. It scales down the replicas after the request load has been below the threshold for longer than 30 seconds.

The deployment spec for our example service contains the following resource requests and limits:

resources:
  limits:
    cpu: 500m
    nvidia.com/gpu: 1
  requests:
    cpu: 200m
    nvidia.com/gpu: 1

With this configuration, each of the service pods will use exactly one NVIDIA GPU. When new pods are created, they will be in Pending state until a GPU is available. Karpenter adds GPU nodes to the cluster as needed to accommodate the pending pods.

A load-generating pod sends HTTP requests to the service with a pre-set frequency. We increase the number of requests by increasing the number of replicas in the load-generator deployment.

A full scaling cycle with utilization-based node consolidation is visualized in a Grafana dashboard. The following dashboard shows the number of nodes in the cluster by instance type (top), the number of requests per second (bottom left), and the number of pods (bottom right).

Scaling Dashboard 1

We start with just the two c5.xlarge CPU instances that the cluster was created with. Then we deploy one service instance, which requires a single GPU. Karpenter adds a g4dn.xlarge instance to accommodate this need. We then deploy the load generator, which causes KEDA to add more service pods and Karpenter adds more GPU instances. After optimization, the state settles on one p3.8xlarge instance with 8 GPUs and one g5.12xlarge instance with 4 GPUs.

When we scale the load-generating deployment to 40 replicas, KEDA creates additional service pods to maintain the required request load per pod. Karpenter adds g4dn.metal and g4dn.12xlarge nodes to the cluster to provide the needed GPUs for the additional pods. In the scaled state, the cluster contains 16 GPU nodes and serves about 300 requests per second. When we scale down the load generator to 1 replica, the reverse process takes place. After the cooldown period, KEDA reduces the number of service pods. Then as fewer pods run, Karpenter removes the underutilized nodes from the cluster and the service pods get consolidated to run on fewer nodes. When the load generator pod is removed, a single service pod on a single g4dn.xlarge instance with 1 GPU remains running. When we remove the service pod as well, the cluster is left in the initial state with only two CPU nodes.

We can observe this behavior when the NodePool has the setting consolidationPolicy: WhenUnderutilized.

With this setting, Karpenter dynamically configures the cluster with as few nodes as possible, while providing sufficient resources for all pods to run and also minimizing cost.

The scaling behavior shown in the following dashboard is observed when the NodePool consolidation policy is set to WhenEmpty, along with consolidateAfter: 30s.

Scaling Dashboard 2

In this scenario, nodes are stopped only when there are no pods running on them after the cool-off period. The scaling curve appears smooth, compared to the utilization-based consolidation policy; however, it can be seen that more nodes are used in the scaled state (22 vs. 16).

Overall, combining pod and cluster auto scaling makes sure that the cluster scales dynamically with the workload, allocating resources when needed and removing them when not in use, thereby maximizing utilization and minimizing cost.

Outcomes

Iambic used this architecture to enable efficient use of GPUs on AWS and migrate workloads from CPU to GPU. By using EC2 GPU powered instances, Amazon EKS, and Karpenter, we were able to enable faster inference for our physics-based models and fast experiment iteration times for applied scientists who rely on training as a service.

The following table summarizes some of the time metrics of this migration.

Task CPUs GPUs
Inference using diffusion models for physics-based ML models 3,600 seconds

100 seconds

(due to inherent batching of GPUs)

ML model training as a service 180 minutes 4 minutes

The following table summarizes some of our time and cost metrics.

Task Performance/Cost
CPUs GPUs
ML model training

240 minutes

average $0.70 per training task

20 minutes

average $0.38 per training task

Summary

In this post, we showcased how Iambic used Karpenter and KEDA to scale our Amazon EKS infrastructure to meet the latency requirements of our AI inference and training workloads. Karpenter and KEDA are powerful open source tools that help auto scale EKS clusters and workloads running on them. This helps optimize compute costs while meeting performance requirements. You can check out the code and deploy the same architecture in your own environment by following the complete walkthrough in this GitHub repo.


About the Authors

Matthew Welborn is the director of Machine Learning at Iambic Therapeutics. He and his team leverage AI to accelerate the identification and development of novel therapeutics, bringing life-saving medicines to patients faster.

Paul Whittemore is a Principal Engineer at Iambic Therapeutics. He supports delivery of the infrastructure for the Iambic AI-driven drug discovery platform.

Alex Iankoulski is a Principal Solutions Architect, ML/AI Frameworks, who focuses on helping customers orchestrate their AI workloads using containers and accelerated computing infrastructure on AWS.

Read More

Generate customized, compliant application IaC scripts for AWS Landing Zone using Amazon Bedrock

Generate customized, compliant application IaC scripts for AWS Landing Zone using Amazon Bedrock

Migrating to the cloud is an essential step for modern organizations aiming to capitalize on the flexibility and scale of cloud resources. Tools like Terraform and AWS CloudFormation are pivotal for such transitions, offering infrastructure as code (IaC) capabilities that define and manage complex cloud environments with precision. However, despite its benefits, IaC’s learning curve, and the complexity of adhering to your organization’s and industry-specific compliance and security standards, could slow down your cloud adoption journey. Organizations typically counter these hurdles by investing in extensive training programs or hiring specialized personnel, which often leads to increased costs and delayed migration timelines.

Generative artificial intelligence (AI) with Amazon Bedrock directly addresses these challenges. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock empowers teams to generate Terraform and CloudFormation scripts that are custom fitted to organizational needs while seamlessly integrating compliance and security best practices. Traditionally, cloud engineers learning IaC would manually sift through documentation and best practices to write compliant IaC scripts. With Amazon Bedrock, teams can input high-level architectural descriptions and use generative AI to generate a baseline configuration of Terraform scripts. These generated scripts are tailored to meet your organization’s unique requirements while conforming to industry standards for security and compliance. These scripts serve as a foundational starting point, requiring further refinement and validation to make sure they meet production-level standards.

This solution not only accelerates the migration process but also provides a standardized and secure cloud infrastructure. Additionally, it offers beginner cloud engineers initial script drafts as standard templates to build upon, facilitating their IaC learning journey.

As you navigate the complexities of cloud migration, the need for a structured, secure, and compliant environment is paramount. AWS Landing Zone addresses this need by offering a standardized approach to deploying AWS resources. This makes sure your cloud foundation is built according to AWS best practices from the start. With AWS Landing Zone, you eliminate the guesswork in security configurations, resource provisioning, and account management. It’s particularly beneficial for organizations looking to scale without compromising on governance or control, providing a clear path to a robust and efficient cloud setup.

In this post, we show you how to generate customized, compliant IaC scripts for AWS Landing Zone using Amazon Bedrock.

AWS Landing Zone architecture in the context of cloud migration

AWS Landing Zone can help you set up a secure, multi-account AWS environment based on AWS best practices. It provides a baseline environment to get started with a multi-account architecture, automate the setup of new accounts, and centralize compliance, security, and identity management. The following is an example of a customized Terraform-based AWS Landing Zone solution, in which each application resides in its own AWS account.

The high-level workflow includes the following components:

  • Module provisioning – Different platform teams across various domains, such as databases, containers, data management, networking, and security, develop and publish certified or custom modules. These are delivered through pipelines to a Terraform private module registry, which is maintained by the organization for consistency and standardization.
  • Account vending machine layer – The account vending machine (AVM) layer uses either AWS Control Tower, AWS Account Factory for Terraform (AFT), or a custom landing zone solution to vend accounts. In this post, we refer to these solutions collectively as the AVM layer. When application owners submit a request to the AVM layer, it processes the input parameters from the request to provision a target AWS account. This account is then provisioned with tailored infrastructure components through AVM customizations, which include AWS Control Tower customizations or AFT customizations.
  • Application infrastructure layer – In this layer, application teams deploy their infrastructure components into the provisioned AWS accounts. This is achieved by writing Terraform code within an application-specific repository. The Terraform code calls upon the modules previously published to the Terraform private registry by the platform teams.

Overcoming on-premises IaC migration challenges with generative AI

Teams maintaining on-premises applications often encounter a learning curve with Terraform, a key tool for IaC in AWS environments. This skill gap can be a significant hurdle in cloud migration efforts. Amazon Bedrock, with its generative AI capabilities, plays an essential role in mitigating this challenge. It facilitates the automation of Terraform code creation for the application infrastructure layer, empowering teams with limited Terraform experience to make an efficient transition to AWS.

Amazon Bedrock generates Terraform code from architectural descriptions. The generated code is custom and standardized based on organizational best practices, security, and regulatory guidelines. This standardization is made possible by using advanced prompts in conjunction with Knowledge Bases for Amazon Bedrock, which stores information on organization-specific Terraform modules. This solution uses Retrieval Augmented Generation (RAG) to enrich the input prompt to Amazon Bedrock with details from the knowledge base, making sure the output Terraform configuration and README contents are compliant with your organization’s Terraform best practices and guidelines.

The following diagram illustrates this architecture.

The workflow consists of the following steps:

  1. The process begins with account vending, where application owners submit a request for a new AWS account. This invokes the AVM, which processes the request parameters to provision the target AWS account.
  2. An architecture description for an application slated for migration is passed as one of the inputs to the AVM layer.
  3. After the account is provisioned, AVM customizations are applied. This can include AWS Control Tower customizations or AFT customizations that set up the account with the necessary infrastructure components and configurations in line with organizational policies.
  4. In parallel, the AVM layer invokes a Lambda function to generate Terraform code. This function enriches the architecture description with a customized prompt, and utilizes RAG to further enhance the prompt with organization-specific coding guidelines from the Knowledge Base for Bedrock. This Knowledge Base includes tailored best practices, security guardrails, and guidelines specific to the organization. See an illustrative example of organization specific Terraform module specifications and  guidelines uploaded to the Knowledge Base.
  5. Before deployment, the initial draft of the Terraform code is thoroughly reviewed by cloud engineers or an automated code review system to confirm that it meets all technical and compliance standards.
  6. The reviewed and updated Terraform scripts are then used to deploy infrastructure components into the newly provisioned AWS account, setting up compute, storage, and networking resources required for the application.

Solution overview

The AWS Landing Zone deployment uses a Lambda function for generating Terraform scripts from architectural inputs. This function, which is central to the operation, translates these inputs into compliant code, using Amazon Bedrock and Knowledge Bases for Amazon Bedrock. The output is then stored in a GitHub repository, corresponding to the specific application in migration. The following sections detail the prerequisites and specific steps needed to implement this solution.

Prerequisites

You should have the following:

Configure the Lambda function to generate custom code

This Lambda function is a key component in automating the creation of customized, compliant Terraform configurations for AWS services. It commits the generated configurations directly to a designated GitHub repository, aligning with organizational best practices. For the function code, refer to the following GitHub repo. For creating lambda function, please follow instructions.

The following diagram illustrates the workflow of the function.

The workflow includes the following steps:

  1. The function is invoked by an event from the AVM layer, containing the architecture description.
  2. The function retrieves and uses Terraform module definitions from the knowledge base.
  3. The function invokes the Amazon Bedrock model twice, following recommended prompt engineering guidelines. The function applies RAG to enrich the input prompt with the Terraform module information, making sure the output code meets organizational best practices.
    • First, generate Terraform configurations following organizational coding guidelines and include Terraform module details from the knowledge base. For example, the prompt could be: “Generate Terraform configurations for AWS services. Follow security best practices by using IAM roles and least privilege permissions. Include all necessary parameters, with default values. Add comments explaining the overall architecture and the purpose of each resource.”
    • Second, create a detailed README file. For example: “Generate a detailed README for the Terraform configuration based on AWS services. Include sections on security improvements, cost optimization tips following the AWS Well-Architected Framework. Also, include detailed Cost Breakdown for each AWS service used with hourly rates and total daily and monthly costs.”
  4. It commits the generated Terraform configuration and the README to the GitHub repository, providing traceability and transparency.
  5. Lastly, it responds with success, including URLs to the committed GitHub files, or returns detailed error information for troubleshooting.

Configure Knowledge Bases for Amazon Bedrock

Follow these steps to set up your knowledge base in Amazon Bedrock:

  1. On the Amazon Bedrock console, choose Knowledge base in the navigation pane.
  2. Choose Create knowledge base.
  3. Enter a clear and descriptive name that reflects the purpose of your knowledge base, such as AWS Account Setup Knowledge Base For Amazon Bedrock.
  4. Assign a pre-configured IAM role with the necessary permissions. It’s typically best to let Amazon Bedrock create this role for you to make sure it has the correct permissions.
  5. Upload a JSON file to an S3 bucket with encryption enabled for security. This file should contain a structured list of AWS services and Terraform modules. For the JSON structure, use the following example from the GitHub repository.
  6. Choose the default embeddings model.
  7. Allow Amazon Bedrock to create and manage the vector store for you in Amazon OpenSearch Service.
  8. Review the information for accuracy. Pay special attention to the S3 bucket URI and IAM role details.
  9. Create your knowledge base.

After you deploy and configure these components, when your AWS Landing Zone solution invokes the Lambda function, the following files are generated:

  • A Terraform configuration file – This file specifies the infrastructure setup.
  • A comprehensive README file – This file documents the security standards embedded within the code, confirming that they align with the security practices outlined in the initial sections. Additionally, this README includes an architectural summary, cost optimization tips, and a detailed cost breakdown for the resources described in the Terraform configuration.

The following screenshot shows an example of the Terraform configuration file.

The following screenshot shows an example of the README file.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the Lambda function if it’s no longer required.
  2. Empty and delete the S3 bucket used for Terraform state storage.
  3. Remove the generated Terraform scripts and README file from the GitHub repo.
  4. Delete the knowledge base if it’s no longer needed.

Conclusion

The generative AI capabilities of Amazon Bedrock not only streamline the creation of compliant Terraform scripts for AWS deployments, but also act as a pivotal learning aid for beginner cloud engineers transitioning on-premises applications to AWS. This approach accelerates the cloud migration process and helps you adhere to best practices. You can also use the solution to provide value after the migration, enhancing daily operations such as ongoing infrastructure and cost optimization. Although we primarily focused on Terraform in this post, these principles can also enhance your AWS CloudFormation deployments, providing a versatile solution for your infrastructure needs.

Ready to simplify your cloud migration process with generative AI in Amazon Bedrock? Begin by exploring the Amazon Bedrock User Guide to understand how it can streamline your organization’s cloud journey. For further assistance and expertise, consider using AWS Professional Services to help you streamline your cloud migration journey and maximize the benefits of Amazon Bedrock.

Unlock the potential for rapid, secure, and efficient cloud adoption with Amazon Bedrock. Take the first step today and discover how it can enhance your organization’s cloud transformation endeavors.


About the Author

Ebbey Thomas specializes in strategizing and developing custom AWS Landing Zone resources with a focus on using generative AI to enhance cloud infrastructure automation. In his role at AWS Professional Services, Ebbey’s expertise is central to architecting solutions that streamline cloud adoption, providing a secure and efficient operational framework for AWS users. He is known for his innovative approach to cloud challenges and his commitment to driving forward the capabilities of cloud services.

Read More

Live Meeting Assistant with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock

Live Meeting Assistant with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock

See CHANGELOG for latest features and fixes.

You’ve likely experienced the challenge of taking notes during a meeting while trying to pay attention to the conversation. You’ve probably also experienced the need to quickly fact-check something that’s been said, or look up information to answer a question that’s just been asked in the call. Or maybe you have a team member that always joins meetings late, and expects you to send them a quick summary over chat to catch them up.

Then there are the times that others are talking in a language that’s not your first language, and you’d love to have a live translation of what people are saying to make sure you understand correctly.

And after the call is over, you usually want to capture a summary for your records, or to send to the participants, with a list of all the action items, owners, and due dates.

All of this, and more, is now possible with our newest sample solution, Live Meeting Assistant (LMA).

Check out the following demo to see how it works.

In this post, we show you how to use LMA with Amazon Transcribe, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock.

Solution overview

The LMA sample solution captures speaker audio and metadata from your browser-based meeting app (as of this writing, Zoom and Chime are supported), or audio only from any other browser-based meeting app, softphone, or audio source. It uses Amazon Transcribe for speech to text, Knowledge Bases for Amazon Bedrock for contextual queries against your company’s documents and knowledge sources, and Amazon Bedrock models for customizable transcription insights and summaries.

Everything you need is provided as open source in our GitHub repo. It’s straightforward to deploy in your AWS account. When you’re done, you’ll wonder how you ever managed without it!

The following are some of the things LMA can do:

  • Live transcription with speaker attribution – LMA is powered by Amazon Transcribe ASR models for low-latency, high-accuracy speech to text. You can teach it brand names and domain-specific terminology if needed, using custom vocabulary and custom language model features in Amazon Transcribe.
  • Live translation – It uses Amazon Translate to optionally show each segment of the conversation translated into your language of choice, from a selection of 75 languages.
  • Context-aware meeting assistant – It uses Knowledge Bases for Amazon Bedrock to provide answers from your trusted sources, using the live transcript as context for fact-checking and follow-up questions. To activate the assistant, just say “Okay, Assistant,” choose the ASK ASSISTANT! button, or enter your own question in the UI.
  • On-demand summaries of the meeting – With the click of a button on the UI, you can generate a summary, which is useful when someone joins late and needs to get caught up. The summaries are generated from the transcript by Amazon Bedrock. LMA also provides options for identifying the current meeting topic, and for generating a list of action items with owners and due dates. You can also create your own custom prompts and corresponding options.
  • Automated summary and insights – When the meeting has ended, LMA automatically runs a set of large language model (LLM) prompts on Amazon Bedrock to summarize the meeting transcript and extract insights. You can customize these prompts as well.
  • Meeting recording – The audio is (optionally) stored for you, so you can replay important sections on the meeting later.
  • Inventory list of meetings – LMA keeps track of all your meetings in a searchable list.
  • Browser extension captures audio and meeting metadata from popular meeting apps – The browser extension captures meeting metadata—the meeting title and names of active speakers—and audio from you (your microphone) and others (from the meeting browser tab). As of this writing, LMA supports Chrome for the browser extension, and Zoom and Chime for meeting apps (with Teams and WebEx coming soon). Standalone meeting apps don’t work with LMA —instead, launch your meetings in the browser.

You are responsible for complying with legal, corporate, and ethical restrictions that apply to recording meetings and calls. Do not use this solution to stream, record, or transcribe calls if otherwise prohibited.

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

You also need an existing knowledge base in Amazon Bedrock. If you haven’t set one up yet, see Create a knowledge base. Populate your knowledge base with content to power LMA’s context-aware meeting assistant.

Finally, LMA uses Amazon Bedrock LLMs for its meeting summarization features. Before proceeding, if you have not previously done so, you must request access to the following Amazon Bedrock models:

  • Titan Embeddings G1 – Text
  • Anthropic: All Claude models

Deploy the solution using AWS CloudFormation

We’ve provided pre-built AWS CloudFormation templates that deploy everything you need in your AWS account.

If you’re a developer and you want to build, deploy, or publish the solution from code, refer to the Developer README.

Complete the following steps to launch the CloudFormation stack:

  1. Log in to the AWS Management Console.
  2. Choose Launch Stack for your desired AWS Region to open the AWS CloudFormation console and create a new stack.
Region Launch Stack
US East (N. Virginia)
US West (Oregon)
  1. For Stack name, use the default value, LMA.
  2. For Admin Email Address, use a valid email address—your temporary password is emailed to this address during the deployment.
  3. For Authorized Account Email Domain, use the domain name part of your corporate email address to allow users with email addresses in the same domain to create their own new UI accounts, or leave blank to prevent users from directly creating their own accounts. You can enter multiple domains as a comma-separated list.
  4. For MeetingAssistService, choose BEDROCK_KNOWLEDGE_BASE (the only available option as of this writing).
  5. For Meeting Assist Bedrock Knowledge Base Id (existing), enter your existing knowledge base ID (for example, JSXXXXX3D8). You can copy it from the Amazon Bedrock console.
  6. For all other parameters, use the default values.

If you want to customize the settings later, for example to add your own AWS Lambda functions, use custom vocabularies and language models to improve accuracy, enable personally identifiable information (PII) redaction, and more, you can update the stack for these parameters.

  1. Select the acknowledgement check boxes, then choose Create stack.

The main CloudFormation stack uses nested stacks to create the following resources in your AWS account:

The stacks take about 35–40 minutes to deploy. The main stack status shows CREATE_COMPLETE when everything is deployed.

Set your password

After you deploy the stack, open the LMA web user interface and set your password by completing the following steps:

  1. Open the email you received, at the email address you provided, with the subject “Welcome to Live Meeting Assistant!”
  2. Open your web browser to the URL shown in the email. You’re directed to the login page.
  3. The email contains a generated temporary password that you use to log in and create your own password. Your user name is your email address.
  4. Set a new password.

Your new password must have a length of at least eight characters, and contain uppercase and lowercase characters, plus numbers and special characters.

  1. Follow the directions to verify your email address, or choose Skip to do it later.

You’re now logged in to LMA.

You also received a similar email with the subject “QnABot Signup Verification Code.” This email contains a generated temporary password that you use to log in and create your own password in the QnABot designer. You use QnABot designer only if you want to customize LMA options and prompts. Your username for QnABot is Admin. You can set your permanent QnABot Admin password now, or keep this email safe in case you want to customize things later.

Download and install the Chrome browser extension

For the best meeting streaming experience, install the LMA browser plugin (currently available for Chrome):

  1. Choose Download Chrome Extension to download the browser extension .zip file (lma-chrome-extension.zip).
  2. Choose (right-click) and expand the .zip file (lma-chrome-extension.zip) to create a local folder named lma-chrome-extension.
  3. Open Chrome and enter the link chrome://extensions into the address bar.
  4. Enable Developer mode.
  5. Choose Load unpacked, navigate to the lma-chrome-extension folder (which you unzipped from the download), and choose Select. This loads your extension.
  6. Pin the new LMA extension to the browser tool bar for easy access—you will use it often to stream your meetings!

Start using LMA

LMA provides two streaming options:

  • Chrome browser extension – Use this to stream audio and speaker metadata from your meeting browser app. It currently works with Zoom and Chime, but we hope to add more meeting apps.
  • LMA Stream Audio tab – Use this to stream audio from your microphone and any Chrome browser-based meeting app, softphone, or audio application.

We show you how to use both options in the following sections.

Use the Chrome browser extension to stream a Zoom call

Complete the following steps to use the browser extension:

  1. Open the LMA extension and log in with your LMA credentials.
  2. Join or start a Zoom meeting in your web browser (do not use the separate Zoom client).

If you already have the Zoom meeting page loaded, reload it.

The LMA extension automatically detects that Zoom is running in the browser tab, and populates your name and the meeting name.

  1. Tell others on the call that you are about to start recording the call using LMA and obtain their permission. Do not proceed if participants object.
  2. Choose Start Listening.
  3. Read and accept the disclaimer, and choose Allow to share the browser tab.

The LMA extension automatically detects and displays the active speaker on the call. If you are alone in the meeting, invite some friends to join, and observe that the names they used to join the call are displayed in the extension when they speak, and are attributed to their words in the LMA transcript.

  1. Choose Open in LMA to see your live transcript in a new tab.
  2. Choose your preferred transcript language, and interact with the meeting assistant using the wake phrase “OK Assistant!” or the Meeting Assist Bot pane.

The ASK ASSISTANT button asks the meeting assistant service (Amazon Bedrock knowledge base) to suggest a good response based on the transcript of the recent interactions in the meeting. Your mileage may vary, so experiment!

  1. When you are done, choose Stop Streaming to end the meeting in LMA.

Within a few seconds, the automated end-of-meeting summaries appear, and the audio recording becomes available. You can continue to use the bot after the call has ended.

Use the LMA UI Stream Audio tab to stream from your microphone and any browser-based audio application

The browser extension is the most convenient way to stream metadata and audio from supported meeting web apps. However, you can also use LMA to stream just the audio from any browser-based softphone, meeting app, or other audio source playing in your Chrome browser, using the convenient Stream Audio tab that is built into the LMA UI.

  1. Open any audio source in a browser tab.

For example, this could be a softphone (such as Google Voice), another meeting app, or for demo purposes, you can simply play a local audio recording or a YouTube video in your browser to emulate another meeting participant. If you just want to try it, open the following YouTube video in a new tab.

  1. In the LMA App UI, choose Stream Audio (no extension) to open the Stream Audio tab.
  2. For Meeting ID, enter a meeting ID.
  3. For Name, enter a name for yourself (applied to audio from your microphone).
  4. For Participant Name(s), enter the names of the participants (applied to the incoming audio source).
  5. Choose Start Streaming.
  6. Choose the browser tab you opened earlier, and choose Allow to share.
  7. Choose the LMA UI tab again to view your new meeting ID listed, showing the meeting as In Progress.
  8. Choose the meeting ID to open the details page, and watch the transcript of the incoming audio, attributed to the participant names that you entered. If you speak, you’ll see the transcription of your own voice.

Use the Stream Audio feature to stream from any softphone app, meeting app, or any other streaming audio playing in the browser, along with your own audio captured from your selected microphone. Always obtain permission from others before recording them using LMA, or any other recording application.

Processing flow overview

How did LMA transcribe and analyze your meeting? Let’s look at how it works. The following diagram shows the main architectural components and how they fit together at a high level.

The LMA user joins a meeting in their browser, enables the LMA browser extension, and authenticates using their LMA credentials. If the meeting app (for example, Zoom.us) is supported by the LMA extension, the user’s name, meeting name, and active speaker names are automatically detected by the extension. If the meeting app is not supported by the extension, then the LMA user can manually enter their name and the meeting topic—active speakers’ names will not be detected.

After getting permission from other participants, the LMA user chooses Start Listening on the LMA extension pane. A secure WebSocket connection is established to the preconfigured LMA stack WebSocket URL, and the user’s authentication token is validated. The LMA browser extension sends a START message to the WebSocket containing the meeting metadata (name, topic, and so on), and starts streaming two-channel audio from the user’s microphone and the incoming audio channel containing the voices of the other meeting participants. The extension monitors the meeting app to detect active speaker changes during the call, and sends that metadata to the WebSocket, enabling LMA to label speech segments with the speaker’s name.

The WebSocket server running in Fargate consumes the real-time two-channel audio fragments from the incoming WebSocket stream. The audio is streamed to Amazon Transcribe, and the transcription results are written in real time to Kinesis Data Streams.

Each meeting processing session runs until the user chooses Stop Listening in the LMA extension pane, or ends the meeting and closes the tab. At the end of the call, the function creates a stereo recording file in Amazon S3 (if recording was enabled when the stack was deployed).

A Lambda function called the Call Event Processor, fed by Kinesis Data Streams, processes and optionally enriches meeting metadata and transcription segments. The Call Event Processor integrates with the meeting assist services. LMA is powered by Amazon Lex, Knowledge Bases for Amazon Bedrock, and Amazon Bedrock LLMs using the open source QnABot on AWS solution for answers based on FAQs and as an orchestrator for request routing to the appropriate AI service. The Call Event Processor also invokes the Transcript Summarization Lambda function when the call ends, to generate a summary of the call from the full transcript.

The Call Event Processor function interfaces with AWS AppSync to persist changes (mutations) in Amazon DynamoDB and send real-time updates to the LMA user’s logged-in web clients (conveniently opened by choosing the Open in LMA option in the browser extension).

The LMA web UI assets are hosted on Amazon S3 and served via CloudFront. Authentication is provided by Amazon Cognito.

When the user is authenticated, the web application establishes a secure GraphQL connection to the AWS AppSync API, and subscribes to receive real-time events such as new calls and call status changes for the meetings list page, and new or updated transcription segments and computed analytics for the meeting details page. When translation is enabled, the web application also interacts securely with Amazon Translate to translate the meeting transcription into the selected language.

The entire processing flow, from ingested speech to live webpage updates, is event driven, and the end-to-end latency is short—typically just a few seconds.

Monitoring and troubleshooting

AWS CloudFormation reports deployment failures and causes on the relevant stack’s Events tab. See Troubleshooting CloudFormation for help with common deployment problems. Look out for deployment failures caused by limit exceeded errors; the LMA stacks create resources that are subject to default account and Region service quotas, such as elastic IP addresses and NAT gateways. When troubleshooting CloudFormation stack failures, always navigate into any failed nested stacks to find the first nested resource failure reported—this is almost always the root cause.

Amazon Transcribe has a default limit of 25 concurrent transcription streams, which limits LMA to 25 concurrent meetings in a given AWS account or Region. Request an increase for the number of concurrent HTTP/2 streams for streaming transcription if you have many users and need to handle a larger number of concurrent meetings in your account.

LMA provides runtime monitoring and logs for each component using CloudWatch:

  • WebSocket processing and transcribing Fargate task – On the Amazon Elastic Container Service (Amazon ECS) console, navigate to the Clusters page and open the LMA-WEBSOCKETSTACK-xxxx-TranscribingCluster function. Choose the Tasks tab and open the task page. Choose Logs and View in CloudWatch to inspect the WebSocket transcriber task logs.
  • Call Event Processor Lambda function – On the Lambda console, open the LMA-AISTACK-CallEventProcessor function. Choose the Monitor tab to see function metrics. Choose View logs in CloudWatch to inspect function logs.
  • AWS AppSync API – On the AWS AppSync console, open the CallAnalytics-LMA API. Choose Monitoring in the navigation pane to see API metrics. Choose View logs in CloudWatch to inspect AWS AppSync API logs.

For QnABot on AWS for Meeting Assist, refer to the Meeting Assist README, and the QnABot solution implementation guide for additional information.

Cost assessment

LMA provides a WebSocket server using Fargate (2vCPU) and VPC networking resources costing about $0.10/hour (approximately $72/month). For more details, see AWS Fargate Pricing.

LMA is enabled using QnABot and Knowledge Bases for Amazon Bedrock. You create your own knowledge base, which you use for LMA and potentially other use cases. For more details, see Amazon Bedrock Pricing. Additional AWS services used by the QnABot solution cost about $0.77/hour. For more details, refer to the list of QnABot on AWS solution costs.

The remaining solution costs are based on usage.

The usage costs add up to about $0.17 for a 5-minute call, although this can vary based on options selected (such as translation), number of LLM summarizations, and total usage because usage affects Free Tier eligibility and volume tiered pricing for many services. For more information about the services that incur usage costs, see the following:

To explore LMA costs for yourself, use AWS Cost Explorer or choose Bill Details on the AWS Billing Dashboard to see your month-to-date spend by service.

Customize your deployment

Use the following CloudFormation template parameters when creating or updating your stack to customize your LCA deployment:

  • To use your own S3 bucket for meeting recordings, use Call Audio Recordings Bucket Name and Audio File Prefix.
  • To redact PII from the transcriptions, set Enable Content Redaction for Transcripts to true, and adjust Transcription PII Redaction Entity Types as needed. For more information, see Redacting or identifying PII in a real-time stream.
  • To improve transcription accuracy for technical and domain-specific acronyms and jargon, set Transcription Custom Vocabulary Name to the name of a custom vocabulary that you already created in Amazon Transcribe or set Transcription Custom Language Model Name to the name of a previously created custom language model. For more information, see Improving Transcription Accuracy.
  • To transcribe meetings in a supported language other than US English, choose the desired value for Language for Transcription.
  • To customize transcript processing, optionally set Lambda Hook Function ARN for Custom Transcript Segment Processing to the ARN of your own Lambda function. For more information, see Using a Lambda function to optionally provide custom logic for transcript processing.
  • To customize the meeting assist capabilities based on the QnABot on AWS solution, Amazon Lex, Amazon Bedrock, and Knowledge Bases for Amazon Bedrock integration, see the Meeting Assist README.
  • To customize transcript summarization by configuring LMA to call your own Lambda function, see Transcript Summarization LAMBDA option.
  • To customize transcript summarization by modifying the default prompts or adding new ones, see Transcript Summarization.
  • To change the retention period, set Record Expiration In Days to the desired value. All call data is permanently deleted from the LMA DynamoDB storage after this period. Changes to this setting apply only to new calls received after the update.

LMA is an open source project. You can fork the LMA GitHub repository, enhance the code, and send us pull requests so we can incorporate and share your improvements!

Update an existing LMA stack

You can update your existing LMA stack to the latest release. For more details, see Update an existing stack.

Clean up

Congratulations! You have completed all the steps for setting up your live call analytics sample solution using AWS services.

When you’re finished experimenting with this sample solution, clean up your resources by using the AWS CloudFormation console to delete the LMA stacks that you deployed. This deletes resources that were created by deploying the solution. The recording S3 buckets, DynamoDB table, and CloudWatch log groups are retained after the stack is deleted to avoid deleting your data.

Live Call Analytics: Companion solution

Our companion solution, Live Call Analytics and Agent Assist (LCA), offers real-time transcription and analytics for contact centers (phone calls) rather than meetings. There are many similarities—in fact, LMA was built using an architecture and many components derived from LCA.

Conclusion

The Live Meeting Assistant sample solution offers a flexible, feature-rich, and customizable approach to provide live meeting assistance to improve your productivity during and after meetings. It uses Amazon AI/ML services like Amazon Transcribe, Amazon Lex, Knowledge Bases for Amazon Bedrock, and Amazon Bedrock LLMs to transcribe and extract real-time insights from your meeting audio.

The sample LMA application is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. Browse to the LMA GitHub repository to explore the code, choose Watch to be notified of new releases, and check the README for the latest documentation updates.

For expert assistance, AWS Professional Services and other AWS Partners are here to help.

We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the LMA GitHub repository.


About the authors

Bob Strahan Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Chris Lott is a Principal Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California and enjoys gardening, aerospace, and traveling the world.

Babu Srinivasan is a Sr. Specialist SA – Language AI services in the World Wide Specialist organization at AWS, with over 24 years of experience in IT and the last 6 years focused on the AWS Cloud. He is passionate about AI/ML. Outside of work, he enjoys woodworking and entertains friends and family (sometimes strangers) with sleight of hand card magic.

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Picture of Gillian ArmstrongGillian Armstrong is a Builder Solutions Architect. She is excited about how the Cloud is opening up opportunities for more people to use technology to solve problems, and especially excited about how cognitive technologies, like conversational AI, are allowing us to interact with computers in more human ways.

Read More

Meta Llama 3 models are now available in Amazon SageMaker JumpStart

Meta Llama 3 models are now available in Amazon SageMaker JumpStart

Today, we are excited to announce that Meta Llama 3 foundation models are available through Amazon SageMaker JumpStart to deploy and run inference. The Llama 3 models are a collection of pre-trained and fine-tuned generative text models.

In this post, we walk through how to discover and deploy Llama 3 models via SageMaker JumpStart.

What is Meta Llama 3

Llama 3 comes in two parameter sizes — 8B and 70B with 8k context length — that can support a broad range of use cases with improvements in reasoning, code generation, and instruction following. Llama 3 uses a decoder-only transformer architecture and new tokenizer that provides improved model performance with 128k size. In addition, Meta improved post-training procedures that substantially reduced false refusal rates, improved alignment, and increased diversity in model responses. You can now derive the combined advantages of Llama 3 performance and MLOps controls with Amazon SageMaker features such as SageMaker Pipelines, SageMaker Debugger, or container logs. In addition, the model will be deployed in an AWS secure environment under your VPC controls, helping provide data security.

What is SageMaker JumpStart

With SageMaker JumpStart, you can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment. You can now discover and deploy Llama 3 models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as SageMaker Pipelines, SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping provide data security. Llama 3 models are available today for deployment and inferencing in Amazon SageMaker Studio in us-east-1 (N. Virginia), us-east-2 (Ohio), us-west-2 (Oregon), eu-west-1 (Ireland) and ap-northeast-1 (Tokyo) AWS Regions.

Discover models

You can access the foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

From the SageMaker JumpStart landing page, you can easily discover various models by browsing through different hubs which are named after model providers. You can find Llama 3 models in Meta hub. If you do not see Llama 3 models, please update your SageMaker Studio version by shutting down and restarting. For more information, refer to Shut down and Update Studio Classic Apps.

You can find Llama 3 models by searching for “Meta-llama-3“ from the search box located at top left.

You can discover all Meta models available in SageMaker JumpStart by clicking on Meta hub.

Clicking on a model card opens the corresponding model detail page, from which you can easily Deploy the model.

Deploy a model

When you choose Deploy and acknowledge the EULA terms, deployment will start.

You can monitor progress of the deployment on the page that shows up after clicking the Deploy button.

Alternatively, you can choose Open notebook to deploy through the example notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using the notebook, you start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker with the following code.

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id = "meta-textgeneration-llama-3-70b-instruct")
predictor = model.deploy(accept_eula=False)

By default accept_eula is set to False. You need to manually accept the EULA to deploy the endpoint successfully, By doing so, you accept the user license agreement and acceptable use policy. You can also find the license agreement Llama website. This deploys the model on SageMaker with default configurations including the default instance type and default VPC configurations. You can change these configuration by specifying non-default values in JumpStartModel. To learn more, please refer to the following documentation.

The following table lists all the Llama 3 models available in SageMaker JumpStart along with the model_ids, default instance types and maximum number of total tokens (sum of the number of input tokens and number of generated tokens) supported for each of these models.

Model Name Model ID Max Total Tokens Default instance type
Meta-Llama-3-8B meta-textgeneration-llama-3-8B 8192 ml.g5.12xlarge
Meta-Llama-3-8B-Instruct meta-textgeneration-llama-3-8B-instruct 8192 ml.g5.12xlarge
Meta-Llama-3-70B meta-textgeneration-llama-3-70b 8192 ml.p4d.24xlarge
Meta-Llama-3-70B-Instruct meta-textgeneration-llama-3-70b-instruct 8192 ml.p4d.24xlarge

Run inference

After you deploy the model, you can run inference against the deployed endpoint through SageMaker predictor. Fine-tuned instruct models (Llama 3: 8B Instruct and 70B Instruct) accept a history of chats between the user and the chat assistant, and generate the subsequent chat. The pre-trained models (Llama 3: 8B and 70B) require a string prompt and perform text completion on the provided prompt.

predictor.predict(payload)

Inference parameters control the text generation process at the endpoint. The Max new tokens control the size of the output generated by the model. This is not same as the number of words because the vocabulary of the model is not the same as the English language vocabulary, and each token may not be an English language word. The temperature parameter controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.

Example prompts for the 70B model

You can use Llama 3 models for text completion for any piece of text. Through text generation, you can perform a variety of tasks such as question answering, language translation, and sentiment analysis, and more. The input payload to the endpoint looks like the following code:

payload = {
    “inputs”: str,
   (optional)"parameters":{"max_new_tokens":int, "top_p":float, "temperature":float}
}

The following are some sample example prompts and the text generated by the model. All outputs are generated with inference parameters {"max_new_tokens":64, "top_p":0.9, "temperature":0.6}.

Input: Simply put, the theory of relativity states that
Output: 1) the laws of physics are the same for all observers, and 2) the speed of light is constant for all observers. The first part of the theory is known as the special theory of relativity, and the second part is known as the general theory of relativity.nThe special theory of relativity states

Input: I believe the meaning of life is
Output:  to live a life of purpose, on purpose. I believe that we are all here to make a difference in the world, and that we are all here to make a difference in the lives of others. I believe that we are all here to make a difference in the world, and that we are all here to mak

Input: A brief message congratulating the team on the launch:nnHi everyone,nnI just 
Output: wanted to say congratulations on the launch of the new website! It looks great and I'm sure it will be a valuable resource for our customers. I know that a lot of hard work went into making this happen, so thank you all for your efforts.nnBest regards,nJohn Doen

In the next example, we show how to use Llama 3 models with few shot in-context learning where we provide training samples available to the model. We only run inference on the deployed model and during this process, and model weights do not change.

Input:
Translate English to French:nsea otter => loutre de mernpeppermint => menthe poivréenplush girafe => girafe peluchencheese =>
Output:  fromagenpanda => pandanpenguin => pingouinnpolar bear => ours polairenpuffin => macareuxnsloth => paresseuxnwalrus => morsenwhale => baleinenzebra => zu00e8brenTranslate English to German:n

Example prompts for the 70B-Instruct model

With Llama 3 instruct models which are optimized for dialogue use cases, the input to the instruct model endpoints is the previous history between the chat assistant and the user. You can ask questions contextual to the conversation that has happened so far. You can also provide the system configuration, such as personas, which define the chat assistant’s behavior. While the input payload format is the same as the base pre-trained model, the input text should be formatted in the following manner:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

[SYSTEM_TEXT]<|eot_id|><|start_header_id|>user<|end_header_id|>

[USER_TEXT]<|eot_id|><|start_header_id|>assistant<|end_header_id|>

In this instruction template, you can optionally start with a system role and include as many alternating roles as desired in the turn-based history. The final role should always be assistant and end with two new line feeds.

Next, consider a few example prompts and responses from the model. In the following example, the user is asking a simple question to the assistant.

Input: <|begin_of_text|><|start_header_id|>user<|end_header_id|>nnwhat is the recipe of mayonnaise?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: The classic condiment! Mayonnaise is a thick, creamy emulsion sauce made from a combination of oil, egg yolks, acid (such as vinegar or lemon juice), and seasonings. Here's a simple recipe to make mayonnaise at home:nn**Ingredients:**nn* 2 egg yolksn* 1 tablespoon lemon juice or vinegar (such as apple cider vinegar or white wine vinegar)n* 1/2 teaspoon Dijon mustard (optional, but recommended for flavor)n* 1/2 cup (120 ml) neutral-tasting oil, such as canola, grapeseed, or sunflower oiln* Salt, to tastenn**Instructions:**nn1. **Start with room temperature ingredients**: This is crucial for emulsification to occur.n2. **In a medium-sized bowl**, whisk together the egg yolks, lemon juice or vinegar, and Dijon mustard (if using) until well combined.n3. **Slowly add the oil**: While continuously whisking the egg yolk mixture, slowly pour in the oil in a thin, steady stream. Start with a very slow drizzle and gradually increase the flow as the mixture thickens.n4. **Whisk constantly**: Keep whisking until the mixture has doubled in volume and has a thick, creamy consistency. This should take about 5-7 minutes.n5. **Check the consistency**: Stop whisking when the mayonnaise has reached the desired consistency. It should be smooth, thick, and hold its shape when the whisk is lifted.n6. **Season with salt**: Add salt to taste, and whisk until well combined.n7. **Taste and adjust**: If the mayonnaise is too thick, add a little water. If it's not thick enough, add a little more oil.nn**Tips and Variations:**nn* Use a neutral-tasting oil to avoid flavoring the mayonnaise.n* If you're using a stand mixer, use the whisk attachment and start with a slow speed, increasing as needed.n* Add flavorings like minced garlic, chopped fresh herbs (e.g., parsley, dill, or tarragon), or grated ginger to create different flavor profiles.n* For a lighter mayonnaise, use only 1 egg yolk or add a little water to thin it out.n* If the mayonnaise breaks or separates, don't worry! Simply start again with a new egg yolk and slowly whisk in the broken mayonnaise.nn**Homemade Mayonnaise vs. Store-Bought:**nnWhile store-bought mayonnaise is convenient, homemade mayonnaise offers several advantages:nn* Freshness: Homemade mayonnaise is made to order, ensuring optimal flavor and texture.n* Customization: You can tailor the flavor to your liking with various seasonings and ingredients.n* No preservatives: Homemade mayonnaise doesn't contain added preservatives, making it a healthier option.nnEnjoy your freshly made mayonnaise!

In the following example, the user has a conversation with the assistant about tourist sites in Paris. Then the user inquires  about the first option recommended by the chat assistant.

Input: <|begin_of_text|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nnParis, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:nn1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.n2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.nnThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.<|eot_id|><|start_header_id|>user<|end_header_id|>nnWhat is so great about #1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: The Eiffel Tower! It's an iconic symbol of Paris and one of the most recognizable landmarks in the world. Here are some reasons why it's so great:nn1. **Engineering marvel**: When it was built for the 1889 World's Fair, the Eiffel Tower was a groundbreaking feat of engineering. It was the tallest structure in the world at the time, standing at 324 meters (1,063 feet).n2. **Panoramic views**: The Eiffel Tower offers breathtaking 360-degree views of the city from its observation decks. You can see many of Paris' famous landmarks, such as the Seine River, the Louvre, and Montmartre.n3. **Romantic atmosphere**: The Eiffel Tower is often associated with romance and is a popular spot for proposals, weddings, and romantic dinners. The tower's iron latticework and twinkling lights at night add to its romantic ambiance.n4. **Historical significance**: The Eiffel Tower has played a significant role in French history, serving as a communication tower during World War I and a symbol of French culture and engineering.n5. **Accessibility**: The Eiffel Tower is easily accessible by public transportation or on foot, and there are various ways to experience it, including taking the stairs, elevator, or a guided tour.n6. **Nighttime spectacle**: The Eiffel Tower is even more magical at night, when it's illuminated with thousands of twinkling lights. The light show, known as the "Eiffel Tower Sparkle," takes place every hour on the hour from dusk to midnight.n7. **Iconic status**: The Eiffel Tower has been featured in countless films, literature, and art, making it an instantly recognizable symbol of Paris and France.nnOverall, the Eiffel Tower is a must-visit attraction in Paris, offering a unique combination of history, engineering, romance, and stunning views.

In the following examples, we set the system’s configuration.

Input: <|begin_of_text|><|start_header_id|>system<|end_header_id|>nnAlways answer with Haiku<|eot_id|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: Eiffel's iron lacenRiver Seine's gentle flow bynMontmartre's charm calls
Input: <|begin_of_text|><|start_header_id|>system<|end_header_id|>nnAlways answer with emojis<|eot_id|><|start_header_id|>user<|end_header_id|>nnHow to go from Beijing to NY?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn
Output: ✈🛬🇺🇸

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Llama 3 models in SageMaker Studio. You now have access to four Llama 3 foundation models that contain billions of parameters. Because foundation models are pretrained, they can also help lower training and infrastructure costs and enable customization for your use case. Check out SageMaker JumpStart in SageMaker Studio now to get started.


About Authors

Kyle Ulrich is an Applied Scientist II at AWS
Xin Huang is a Senior Applied Scientist at AWS
Qing Lan is a Senior Software Developer Engineer at AWS
Haotian An is a Software Developer Engineer II at AWS
Christopher Whitten is a Software Development Engineer II at AWS
Tyler Osterberg is a Software Development Engineer I at AWS
Manan Shah is a Software Development Manager at AWS
Jonathan Guinegagne is a Senior Software Developer Engineer at AWS
Adriana Simmons is a Senior Product Marketing Manager at AWS
June Won is a Senior Product Manager at AWS
Ashish Khetan is a Senior Applied Scientist at AWS
Rachna Chadha is a Principal Solution Architect at AWS
Deepak Rupakula is a Principal GTM Specialist at AWS

Read More

Slack delivers native and secure generative AI powered by Amazon SageMaker JumpStart

Slack delivers native and secure generative AI powered by Amazon SageMaker JumpStart

This post is co-authored by Jackie Rocca, VP of Product, AI at Slack

Slack is where work happens. It’s the AI-powered platform for work that connects people, conversations, apps, and systems together in one place. With the newly launched Slack AI—a trusted, native, generative artificial intelligence (AI) experience available directly in Slack—users can surface and prioritize information so they can find their focus and do their most productive work.

We are excited to announce that Slack, a Salesforce company, has collaborated with Amazon SageMaker JumpStart to power Slack AI’s initial search and summarization features and provide safeguards for Slack to use large language models (LLMs) more securely. Slack worked with SageMaker JumpStart to host industry-leading third-party LLMs so that data is not shared with the infrastructure owned by third party model providers.

This keeps customer data in Slack at all times and upholds the same security practices and compliance standards that customers expect from Slack itself. Slack is also using Amazon SageMaker inference capabilities for advanced routing strategies to scale the solution to customers with optimal performance, latency, and throughput.

“With Amazon SageMaker JumpStart, Slack can access state-of-the-art foundation models to power Slack AI, while prioritizing security and privacy. Slack customers can now search smarter, summarize conversations instantly, and be at their most productive.”

– Jackie Rocca, VP Product, AI at Slack

Foundation models in SageMaker JumpStart

SageMaker JumpStart is a machine learning (ML) hub that can help accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select foundation models (FMs) quickly based on predefined quality and responsibility metrics to perform tasks like article summarization and image generation. Pretrained models are fully customizable for your use case with your data, and you can effortlessly deploy them into production with the user interface or SDK. In addition, you can access prebuilt solutions to solve common use cases and share ML artifacts, including ML models and notebooks, within your organization to accelerate ML model building and deployment. None of your data is used to train the underlying models. All the data is encrypted and is never shared with third-party vendors so you can trust that your data remains private and confidential.

Check out the SageMaker JumpStart model page for available models.

Slack AI

Slack launched Slack AI to provide native generative AI capabilities so that customers can easily find and consume large volumes of information quickly, enabling them to get even more value out of their shared knowledge in Slack.  For example, users can ask a question in plain language and instantly get clear and concise answers with enhanced search. They can catch up on channels and threads in one click with conversation summaries. And they can access personalized, daily digests of what’s happening in select channels with the newly launched recaps.

Because trust is Slack’s most important value, Slack AI runs on an enterprise-grade infrastructure they built on AWS, upholding the same security practices and compliance standards that customers expect. Slack AI is built for security-conscious customers and is designed to be secure by design—customer data remains in-house, data is not used for LLM training purposes, and data remains siloed.

Solution overview

SageMaker JumpStart provides access to many LLMs, and Slack selects the right FMs that fit their use cases. Because these models are hosted on Slack’s owned AWS infrastructure, data sent to models during invocation doesn’t leave Slack’s AWS infrastructure. In addition, to provide a secure solution, data sent for invoking SageMaker models is encrypted in transit. The data sent to SageMaker JumpStart endpoints for invoking models is not used to train base models. SageMaker JumpStart allows Slack to support high standards for security and data privacy, while also using state-of-the-art models that help Slack AI perform optimally for Slack customers.

SageMaker JumpStart endpoints serving Slack business applications are powered by AWS instances. SageMaker supports a wide range of instance types for model deployment, which allows Slack to pick the instance that is best suited to support latency and scalability requirements of Slack AI use cases. Slack AI has access to multi-GPU based instances to host their SageMaker JumpStart models. Multiple GPU instances allow each instance backing Slack AI’s endpoint to host multiple copies of a model. This helps improve resource utilization and reduce model deployment cost. For more information, refer to Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency.

The following diagram illustrates the solution architecture.

To use the instances most effectively and support the concurrency and latency requirements, Slack used SageMaker-offered routing strategies with their SageMaker endpoints. By default, a SageMaker endpoint uniformly distributes incoming requests to ML instances using a round-robin algorithm routing strategy called RANDOM. However, with generative AI workloads, requests and responses can be extremely variable, and it’s desirable to load balance by considering the capacity and utilization of the instance rather than random load balancing. To effectively distribute requests across instances backing the endpoints, Slack uses the LEAST_OUTSTANDING_REQUESTS (LAR) routing strategy. This strategy routes requests to the specific instances that have more capacity to process requests instead of randomly picking any available instance. The LAR strategy provides more uniform load balancing and resource utilization. As a result, Slack AI noticed over a 39% latency decrease in their p95 latency numbers when enabling LEAST_OUTSTANDING_REQUESTS compared to RANDOM.

For more details on SageMaker routing strategies, see Minimize real-time inference latency by using Amazon SageMaker routing strategies.

Conclusion

Slack is delivering native generative AI capabilities that will help their customers be more productive and easily tap into the collective knowledge that’s embedded in their Slack conversations. With fast access to a large selection of FMs and advanced load balancing capabilities that are hosted in dedicated instances through SageMaker JumpStart, Slack AI is able to provide rich generative AI features in a more robust and quicker manner, while upholding Slack’s trust and security standards.

Learn more about SageMaker JumpStart, Slack AI and how the Slack team built Slack AI to be secure and private. Leave your thoughts and questions in the comments section.


About the Authors

Jackie Rocca is VP of Product at Slack, where she oversees the vision and execution of Slack AI, which brings generative AI natively and securely into Slack’s user experience. In her five years at Slack, Jackie has delivered on a number of initiatives to push Slack’s business forward. Now she’s on a mission to help customers accelerate their productivity and get even more value out of their conversations, data, and collective knowledge with generative AI. Prior to her time at Slack, Jackie was a Product Manager at Google for more than six years, where she helped launch and grow Youtube TV. Jackie is based in the San Francisco Bay Area.

Rachna Chadha is a Principal Solutions Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that the ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Maninder (Mani) Kaur is the AI/ML Specialist lead for Strategic ISVs at AWS. With her customer-first approach, Mani helps strategic customers shape their AI/ML strategy, fuel innovation, and accelerate their AI/ML journey. Mani is a firm believer of ethical and responsible AI, and strives to ensure that her customers’ AI solutions align with these principles.

Gene Ting is a Principal Solutions Architect at AWS. He is focused on helping enterprise customers build and operate workloads securely on AWS. In his free time, Gene enjoys teaching kids technology and sports, as well as following the latest on cybersecurity.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Read More

Uncover hidden connections in unstructured financial data with Amazon Bedrock and Amazon Neptune

Uncover hidden connections in unstructured financial data with Amazon Bedrock and Amazon Neptune

In asset management, portfolio managers need to closely monitor companies in their investment universe to identify risks and opportunities, and guide investment decisions. Tracking direct events like earnings reports or credit downgrades is straightforward—you can set up alerts to notify managers of news containing company names. However, detecting second and third-order impacts arising from events at suppliers, customers, partners, or other entities in a company’s ecosystem is challenging.

For example, a supply chain disruption at a key vendor would likely negatively impact downstream manufacturers. Or the loss of a top customer for a major client poses a demand risk for the supplier. Very often, such events fail to make headlines featuring the impacted company directly, but are still important to pay attention to. In this post, we demonstrate an automated solution combining knowledge graphs and generative artificial intelligence (AI) to surface such risks by cross-referencing relationship maps with real-time news.

Broadly, this entails two steps: First, building the intricate relationships between companies (customers, suppliers, directors) into a knowledge graph. Second, using this graph database along with generative AI to detect second and third-order impacts from news events. For instance, this solution can highlight that delays at a parts supplier may disrupt production for downstream auto manufacturers in a portfolio though none are directly referenced.

With AWS, you can deploy this solution in a serverless, scalable, and fully event-driven architecture. This post demonstrates a proof of concept built on two key AWS services well suited for graph knowledge representation and natural language processing: Amazon Neptune and Amazon Bedrock. Neptune is a fast, reliable, fully managed graph database service that makes it straightforward to build and run applications that work with highly connected datasets. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Overall, this prototype demonstrates the art of possible with knowledge graphs and generative AI—deriving signals by connecting disparate dots. The takeaway for investment professionals is the ability to stay on top of developments closer to the signal while avoiding noise.

Build the knowledge graph

The first step in this solution is building a knowledge graph, and a valuable yet often overlooked data source for knowledge graphs is company annual reports. Because official corporate publications undergo scrutiny before release, the information they contain is likely to be accurate and reliable. However, annual reports are written in an unstructured format meant for human reading rather than machine consumption. To unlock their potential, you need a way to systematically extract and structure the wealth of facts and relationships they contain.

With generative AI services like Amazon Bedrock, you now have the capability to automate this process. You can take an annual report and trigger a processing pipeline to ingest the report, break it down into smaller chunks, and apply natural language understanding to pull out salient entities and relationships.

For example, a sentence stating that “[Company A] expanded its European electric delivery fleet with an order for 1,800 electric vans from [Company B]” would allow Amazon Bedrock to identify the following:

  • [Company A] as a customer
  • [Company B] as a supplier
  • A supplier relationship between [Company A] and [Company B]
  • Relationship details of “supplier of electric delivery vans”

Extracting such structured data from unstructured documents requires providing carefully crafted prompts to large language models (LLMs) so they can analyze text to pull out entities like companies and people, as well as relationships such as customers, suppliers, and more. The prompts contain clear instructions on what to look out for and the structure to return the data in. By repeating this process across the entire annual report, you can extract the relevant entities and relationships to construct a rich knowledge graph.

However, before committing the extracted information to the knowledge graph, you need to first disambiguate the entities. For instance, there may already be another ‘[Company A]’ entity in the knowledge graph, but it could represent a different organization with the same name. Amazon Bedrock can reason and compare the attributes such as business focus area, industry, and revenue-generating industries and relationships to other entities to determine if the two entities are actually distinct. This prevents inaccurately merging unrelated companies into a single entity.

After disambiguation is complete, you can reliably add new entities and relationships into your Neptune knowledge graph, enriching it with the facts extracted from annual reports. Over time, the ingestion of reliable data and integration of more reliable data sources will help build a comprehensive knowledge graph that can support revealing insights through graph queries and analytics.

This automation enabled by generative AI makes it feasible to process thousands of annual reports and unlocks an invaluable asset for knowledge graph curation that would otherwise go untapped due to the prohibitively high manual effort needed.

The following screenshot shows an example of the visual exploration that’s possible in a Neptune graph database using the Graph Explorer tool.

Process news articles

The next step of the solution is automatically enriching portfolio managers’ news feeds and highlighting articles relevant to their interests and investments. For the news feed, portfolio managers can subscribe to any third-party news provider through AWS Data Exchange or another news API of their choice.

When a news article enters the system, an ingestion pipeline is invoked to process the content. Using techniques similar to the processing of annual reports, Amazon Bedrock is used to extract entities, attributes, and relationships from the news article, which are then used to disambiguate against the knowledge graph to identify the corresponding entity in the knowledge graph.

The knowledge graph contains connections between companies and people, and by linking article entities to existing nodes, you can identify if any subjects are within two hops of the companies that the portfolio manager has invested in or is interested in. Finding such a connection indicates the article may be relevant to the portfolio manager, and because the underlying data is represented in a knowledge graph, it can be visualized to help the portfolio manager understand why and how this context is relevant. In addition to identifying connections to the portfolio, you can also use Amazon Bedrock to perform sentiment analysis on the entities referenced.

The final output is an enriched news feed surfacing articles likely to impact the portfolio manager’s areas of interest and investments.

Solution overview

The overall architecture of the solution looks like the following diagram.

The workflow consists of the following steps:

  1. A user uploads official reports (in PDF format) to an Amazon Simple Storage Service (Amazon S3) bucket. The reports should be officially published reports to minimize the inclusion of inaccurate data into your knowledge graph (as opposed to news and tabloids).
  2. The S3 event notification invokes an AWS Lambda function, which sends the S3 bucket and file name to an Amazon Simple Queue Service (Amazon SQS) queue. The First-In-First-Out (FIFO) queue makes sure that the report ingestion process is performed sequentially to reduce the likelihood of introducing duplicate data into your knowledge graph.
  3. An Amazon EventBridge time-based event runs every minute to start the run of an AWS Step Functions state machine asynchronously.
  4. The Step Functions state machine runs through a series of tasks to process the uploaded document by extracting key information and inserting it into your knowledge graph:
    1. Receive the queue message from Amazon SQS.
    2. Download the PDF report file from Amazon S3, split it into multiple smaller text chunks (approximately 1,000 words) for processing, and store the text chunks in Amazon DynamoDB.
    3. Use Anthropic’s Claude v3 Sonnet on Amazon Bedrock to process the first few text chunks to determine the main entity that the report is referring to, together with relevant attributes (such as industry).
    4. Retrieve the text chunks from DynamoDB and for each text chunk, invoke a Lambda function to extract out entities (such as company or person), and its relationship (customer, supplier, partner, competitor, or director) to the main entity using Amazon Bedrock.
    5. Consolidate all extracted information.
    6. Filter out noise and irrelevant entities (for example, generic terms such as “consumers”) using Amazon Bedrock.
    7. Use Amazon Bedrock to perform disambiguation by reasoning using the extracted information against the list of similar entities from the knowledge graph. If the entity does not exist, insert it. Otherwise, use the entity that already exists in the knowledge graph. Insert all relationships extracted.
    8. Clean up by deleting the SQS queue message and the S3 file.
  5. A user accesses a React-based web application to view the news articles that are supplemented with the entity, sentiment, and connection path information.
  6. Using the web application, the user specifies the number of hops (default N=2) on the connection path to monitor.
  7. Using the web application, the user specifies the list of entities to track.
  8. To generate fictional news, the user chooses Generate Sample News to generate 10 sample financial news articles with random content to be fed into the news ingestion process. Content is generated using Amazon Bedrock and is purely fictional.
  9. To download actual news, the user chooses Download Latest News to download the top news happening today (powered by NewsAPI.org).
  10. The news file (TXT format) is uploaded to an S3 bucket. Steps 8 and 9 upload news to the S3 bucket automatically, but you can also build integrations to your preferred news provider such as AWS Data Exchange or any third-party news provider to drop news articles as files into the S3 bucket. News data file content should be formatted as <date>{dd mmm yyyy}</date><title>{title}</title><text>{news content}</text>.
  11. The S3 event notification sends the S3 bucket or file name to Amazon SQS (standard), which invokes multiple Lambda functions to process the news data in parallel:
    1. Use Amazon Bedrock to extract entities mentioned in the news together with any related information, relationships, and sentiment of the mentioned entity.
    2. Check against the knowledge graph and use Amazon Bedrock to perform disambiguation by reasoning using the available information from the news and from within the knowledge graph to identify the corresponding entity.
    3. After the entity has been located, search for and return any connection paths connecting to entities marked with INTERESTED=YES in the knowledge graph that are within N=2 hops away.
  12. The web application auto refreshes every 1 second to pull out the latest set of processed news to display on the web application.

Deploy the prototype

You can deploy the prototype solution and start experimenting yourself. The prototype is available from GitHub and includes details on the following:

  • Deployment prerequisites
  • Deployment steps
  • Cleanup steps

Summary

This post demonstrated a proof of concept solution to help portfolio managers detect second- and third-order risks from news events, without direct references to companies they track. By combining a knowledge graph of intricate company relationships with real-time news analysis using generative AI, downstream impacts can be highlighted, such as production delays from supplier hiccups.

Although it’s only a prototype, this solution shows the promise of knowledge graphs and language models to connect dots and derive signals from noise. These technologies can aid investment professionals by revealing risks faster through relationship mappings and reasoning. Overall, this is a promising application of graph databases and AI that warrants exploration to augment investment analysis and decision-making.

If this example of generative AI in financial services is of interest to your business, or you have a similar idea, reach out to your AWS account manager, and we will be delighted to explore further with you.


About the Author

Xan Huang is a Senior Solutions Architect with AWS and is based in Singapore. He works with major financial institutions to design and build secure, scalable, and highly available solutions in the cloud. Outside of work, Xan spends most of his free time with his family and getting bossed around by his 3-year-old daughter. You can find Xan on LinkedIn.

Read More

Open source observability for AWS Inferentia nodes within Amazon EKS clusters

Open source observability for AWS Inferentia nodes within Amazon EKS clusters

Recent developments in machine learning (ML) have led to increasingly large models, some of which require hundreds of billions of parameters. Although they are more powerful, training and inference on those models require significant computational resources. Despite the availability of advanced distributed training libraries, it’s common for training and inference jobs to need hundreds of accelerators (GPUs or purpose-built ML chips such as AWS Trainium and AWS Inferentia), and therefore tens or hundreds of instances.

In such distributed environments, observability of both instances and ML chips becomes key to model performance fine-tuning and cost optimization. Metrics allow teams to understand workload behavior and optimize resource allocation and utilization, diagnose anomalies, and increase overall infrastructure efficiency. For data scientists, ML chips utilization and saturation are also relevant for capacity planning.

This post walks you through the Open Source Observability pattern for AWS Inferentia, which shows you how to monitor the performance of ML chips, used in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, with data plane nodes based on Amazon Elastic Compute Cloud (Amazon EC2) instances of type Inf1 and Inf2.

The pattern is part of the AWS CDK Observability Accelerator, a set of opinionated modules to help you set observability for Amazon EKS clusters. The AWS CDK Observability Accelerator is organized around patterns, which are reusable units for deploying multiple resources. The open source observability set of patterns instruments observability with Amazon Managed Grafana dashboards, an AWS Distro for OpenTelemetry collector to collect metrics, and Amazon Managed Service for Prometheus to store them.

Solution overview

The following diagram illustrates the solution architecture.

This solution deploys an Amazon EKS cluster with a node group that includes Inf1 instances.

The AMI type of the node group is AL2_x86_64_GPU, which uses the Amazon EKS optimized accelerated Amazon Linux AMI. In addition to the standard Amazon EKS-optimized AMI configuration, the accelerated AMI includes the NeuronX runtime.

To access the ML chips from Kubernetes, the pattern deploys the AWS Neuron device plugin.

Metrics are exposed to Amazon Managed Service for Prometheus by the neuron-monitor DaemonSet, which deploys a minimal container, with the Neuron tools installed. Specifically, the neuron-monitor DaemonSet runs the neuron-monitor command piped into the neuron-monitor-prometheus.py companion script (both commands are part of the container):

neuron-monitor | neuron-monitor-prometheus.py --port <port>

The command uses the following components:

  • neuron-monitor collects metrics and stats from the Neuron applications running on the system and streams the collected data to stdout in JSON format
  • neuron-monitor-prometheus.py maps and exposes the telemetry data from JSON format into Prometheus-compatible format

Data is visualized in Amazon Managed Grafana by the corresponding dashboard.

The rest of the setup to collect and visualize metrics with Amazon Managed Service for Prometheus and Amazon Managed Grafana is similar to that used in other open source based patterns, which are included in the AWS Observability Accelerator for CDK GitHub repository.

Prerequisites

You need the following to complete the steps in this post:

Set up the environment

Complete the following steps to set up your environment:

  1. Open a terminal window and run the following commands:
export AWS_REGION=<YOUR AWS REGION>
export ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
  1. Retrieve the workspace IDs of any existing Amazon Managed Grafana workspace:
aws grafana list-workspaces

The following is our sample output:

{
  "workspaces": [
    {
      "authentication": {
        "providers": [
          "AWS_SSO"
        ]
      },
      "created": "2023-06-07T12:23:56.625000-04:00",
      "description": "accelerator-workspace",
      "endpoint": "g-XYZ.grafana-workspace.us-east-2.amazonaws.com",
      "grafanaVersion": "9.4",
      "id": "g-XYZ",
      "modified": "2023-06-07T12:30:09.892000-04:00",
      "name": "accelerator-workspace",
      "notificationDestinations": [
        "SNS"
      ],
      "status": "ACTIVE",
      "tags": {}
    }
  ]
}
  1. Assign the values of id and endpoint to the following environment variables:
export COA_AMG_WORKSPACE_ID="<<YOUR-WORKSPACE-ID, similar to the above g-XYZ, without quotation marks>>"
export COA_AMG_ENDPOINT_URL="<<https://YOUR-WORKSPACE-URL, including protocol (i.e. https://), without quotation marks, similar to the above https://g-XYZ.grafana-workspace.us-east-2.amazonaws.com>>"

COA_AMG_ENDPOINT_URL needs to include https://.

  1. Create a Grafana API key from the Amazon Managed Grafana workspace:
export AMG_API_KEY=$(aws grafana create-workspace-api-key 
--key-name "grafana-operator-key" 
--key-role "ADMIN" 
--seconds-to-live 432000 
--workspace-id $COA_AMG_WORKSPACE_ID 
--query key 
--output text)
  1. Set up a secret in AWS Systems Manager:
aws ssm put-parameter --name "/cdk-accelerator/grafana-api-key" 
--type "SecureString" 
--value $AMG_API_KEY 
--region $AWS_REGION

The secret will be accessed by the External Secrets add-on and made available as a native Kubernetes secret in the EKS cluster.

Bootstrap the AWS CDK environment

The first step to any AWS CDK deployment is bootstrapping the environment. You use the cdk bootstrap command in the AWS CDK CLI to prepare the environment (a combination of AWS account and AWS Region) with resources required by AWS CDK to perform deployments into that environment. AWS CDK bootstrapping is needed for each account and Region combination, so if you already bootstrapped AWS CDK in a Region, you don’t need to repeat the bootstrapping process.

cdk bootstrap aws://$ACCOUNT_ID/$AWS_REGION

Deploy the solution

Complete the following steps to deploy the solution:

  1. Clone the cdk-aws-observability-accelerator repository and install the dependency packages. This repository contains AWS CDK v2 code written in TypeScript.
git clone https://github.com/aws-observability/cdk-aws-observability-accelerator.git
cd cdk-aws-observability-accelerator

The actual settings for Grafana dashboard JSON files are expected to be specified in the AWS CDK context. You need to update context in the cdk.json file, located in the current directory. The location of the dashboard is specified by the fluxRepository.values.GRAFANA_NEURON_DASH_URL parameter, and neuronNodeGroup is used to set the instance type, number, and Amazon Elastic Block Store (Amazon EBS) size used for the nodes.

  1. Enter the following snippet into cdk.json, replacing context:
"context": {
    "fluxRepository": {
      "name": "grafana-dashboards",
      "namespace": "grafana-operator",
      "repository": {
        "repoUrl": "https://github.com/aws-observability/aws-observability-accelerator",
        "name": "grafana-dashboards",
        "targetRevision": "main",
        "path": "./artifacts/grafana-operator-manifests/eks/infrastructure"
      },
      "values": {
        "GRAFANA_CLUSTER_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json",
        "GRAFANA_KUBELET_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json",
        "GRAFANA_NSWRKLDS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json",
        "GRAFANA_NODEEXP_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json",
        "GRAFANA_NODES_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodes.json",
        "GRAFANA_WORKLOADS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/workloads.json",
        "GRAFANA_NEURON_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/neuron/neuron-monitor.json"
      },
      "kustomizations": [
        {
          "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/infrastructure"
        },
        {
          "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/neuron"
        }
      ]
    },
     "neuronNodeGroup": {
      "instanceClass": "inf1",
      "instanceSize": "2xlarge",
      "desiredSize": 1, 
      "minSize": 1, 
      "maxSize": 3,
      "ebsSize": 512
    }
  }

You can replace the Inf1 instance type with Inf2 and change the size as needed. To check availability in your selected Region, run the following command (amend Values as you see fit):

aws ec2 describe-instance-type-offerings 
--filters Name=instance-type,Values="inf1*" 
--query "InstanceTypeOfferings[].InstanceType" 
--region $AWS_REGION
  1. Install the project dependencies:
npm install
  1. Run the following commands to deploy the open source observability pattern:
make build
make pattern single-new-eks-inferentia-opensource-observability deploy

Validate the solution

Complete the following steps to validate the solution:

  1. Run the update-kubeconfig command. You should be able to get the command from the output message of the previous command:
aws eks update-kubeconfig --name single-new-eks-inferentia-opensource... --region <your region> --role-arn arn:aws:iam::xxxxxxxxx:role/single-new-eks-....
  1. Verify the resources you created:
kubectl get pods -A

The following screenshot shows our sample output.

  1. Make sure the neuron-device-plugin-daemonset DaemonSet is running:
kubectl get ds neuron-device-plugin-daemonset --namespace kube-system

The following is our expected output:

NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
neuron-device-plugin-daemonset   1         1         1       1            1           <none>          2h
  1. Confirm that the neuron-monitor DaemonSet is running:
kubectl get ds neuron-monitor --namespace kube-system

The following is our expected output:

NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
neuron-monitor   1         1         1       1            1           <none>          2h
  1. To verify that the Neuron devices and cores are visible, run the neuron-ls and neuron-top commands from, for example, your neuron-monitor pod (you can get the pod’s name from the output of kubectl get pods -A):
kubectl exec -it {your neuron-monitor pod} -n kube-system -- /bin/bash -c "neuron-ls"

The following screenshot shows our expected output.

kubectl exec -it {your neuron-monitor pod} -n kube-system -- /bin/bash -c "neuron-top"

The following screenshot shows our expected output.

Visualize data using the Grafana Neuron dashboard

Log in to your Amazon Managed Grafana workspace and navigate to the Dashboards panel. You should see a dashboard named Neuron / Monitor.

To see some interesting metrics on the Grafana dashboard, we apply the following manifest:

curl https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/k8s-deployment-manifest-templates/neuron/pytorch-inference-resnet50.yml | kubectl apply -f -

This is a sample workload that compiles the torchvision ResNet50 model and runs repetitive inference in a loop to generate telemetry data.

To verify the pod was successfully deployed, run the following code:

kubectl get pods

You should see a pod named pytorch-inference-resnet50.

After a few minutes, looking into the Neuron / Monitor dashboard, you should see the gathered metrics similar to the following screenshots.

Grafana Operator and Flux always work together to synchronize your dashboards with Git. If you delete your dashboards by accident, they will be re-provisioned automatically.

Clean up

You can delete the whole AWS CDK stack with the following command:

make pattern single-new-eks-inferentia-opensource-observability destroy

Conclusion

In this post, we showed you how to introduce observability, with open source tooling, into an EKS cluster featuring a data plane running EC2 Inf1 instances. We started by selecting the Amazon EKS-optimized accelerated AMI for the data plane nodes, which includes the Neuron container runtime, providing access to AWS Inferentia and Trainium Neuron devices. Then, to expose the Neuron cores and devices to Kubernetes, we deployed the Neuron device plugin. The actual collection and mapping of telemetry data into Prometheus-compatible format was achieved via neuron-monitor and neuron-monitor-prometheus.py. Metrics were sourced from Amazon Managed Service for Prometheus and displayed on the Neuron dashboard of Amazon Managed Grafana.

We recommend that you explore additional observability patterns in the AWS Observability Accelerator for CDK GitHub repo. To learn more about Neuron, refer to the AWS Neuron Documentation.


About the Author

Riccardo Freschi is a Sr. Solutions Architect at AWS, focusing on application modernization. He works closely with partners and customers to help them transform their IT landscapes in their journey to the AWS Cloud by refactoring existing applications and building new ones.

Read More