Amazon AWS – Page 83

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024

by Francesco Kruk Amazon AWS

Today, we are excited to announce that the Jina Embeddings v2 model, developed by Jina AI, is available for customers through Amazon SageMaker JumpStart to deploy with one click for running model inference. This state-of-the-art model supports an impressive 8,192-tokens context length. You can deploy this model with SageMaker JumpStart, a machine learning (ML) hub with foundation models, built-in algorithms, and pre-built ML solutions that you can deploy with just a few clicks.

Text embedding refers to the process of transforming text into numerical representations that reside in a high-dimensional vector space. Text embeddings have a broad range of applications in enterprise artificial intelligence (AI), including the following:

Multimodal search for ecommerce
Content personalization
Recommender systems
Data analytics

Jina Embeddings v2 is a state-of-the-art collection of text embedding models, trained by Berlin-based Jina AI, that boast high performance on several public benchmarks.

In this post, we walk through how to discover and deploy the jina-embeddings-v2 model as part of a Retrieval Augmented Generation (RAG)-based question answering system in SageMaker JumpStart. You can use this tutorial as a starting point for a variety of chatbot-based solutions for customer service, internal support, and question answering systems based on internal and private documents.

What is RAG?

RAG is the process of optimizing the output of a large language model (LLM) so it references an authoritative knowledge base outside of its training data sources before generating a response.

LLMs are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It’s a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

What does Jina Embeddings v2 bring to RAG applications?

A RAG system uses a vector database to serve as a knowledge retriever. It must extract a query from a user’s prompt and send it to a vector database to reliably find as much semantic information as possible. The following diagram illustrates the architecture of a RAG application with Jina AI and Amazon SageMaker.

Jina Embeddings v2 is the preferred choice for experienced ML scientists for the following reasons:

State-of-the-art performance – We have shown on various text embedding benchmarks that Jina Embeddings v2 models excel on tasks such as classification, reranking, summarization, and retrieval. Some of the benchmarks demonstrating their performance are MTEB, an independent study of combining embedding models with reranking models, and the LoCo benchmark by a Stanford University group.
Long input-context length – Jina Embeddings v2 models support 8,192 input tokens. This makes the models especially powerful at tasks such as clustering for long documents like legal text or product documentation.
Support for bilingual text input – Recent research shows that multilingual models without specific language training show strong biases towards English grammatical structures in embeddings. Jina AI’s bilingual embedding models include jina-embeddings-v2-base-de, jina-embeddings-v2-base-zh, jina-embeddings-v2-base-es, and jina-embeddings-v2-base-code. They were trained to encode texts in a combination of English-German, English-Chinese, English-Spanish, and English-Code, respectively, allowing the use of either language as the query or target document in retrieval applications.
Cost-effectiveness of operating – Jina Embeddings v2 provides high performance on information retrieval tasks with relatively small models and compact embedding vectors. For example, jina-embeddings-v2-base-de has a size of 322 MB with a performance score of 60.1%. A smaller vector size means a great amount of cost savings while storing them in a vector database.

What is SageMaker JumpStart?

With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. Developers can deploy foundation models to dedicated SageMaker instances within a network-isolated environment, and customize models using SageMaker for model training and deployment.

You can now discover and deploy a Jina Embeddings v2 model with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines and Amazon SageMaker Debugger. With SageMaker JumpStart, the model is deployed in an AWS secure environment and under your VPC controls, helping provide data security.

Jina Embeddings models are available in AWS Marketplace so you can integrate them directly into your deployments when working in SageMaker.

AWS Marketplace enables you to find third-party software, data, and services that run on AWS and manage them from a centralized location. AWS Marketplace includes thousands of software listings and simplifies software licensing and procurement with flexible pricing options and multiple deployment methods.

Solution overview

We’ve prepared a notebook that constructs and runs a RAG question answering system using Jina Embeddings and the Mixtral 8x7B LLM in SageMaker JumpStart.

In the following sections, we give you an overview of the main steps needed to bring a RAG application to life using generative AI models on SageMaker JumpStart. Although we omit some of the boilerplate code and installation steps in this post for reasons of readability, you can access the full Python notebook to run yourself.

Connecting to a Jina Embeddings v2 endpoint

To start using Jina Embeddings v2 models, complete the following steps:

In SageMaker Studio, choose JumpStart in the navigation pane.
Search for “jina” and you will see the provider page link and models available from Jina AI.
Choose Jina Embeddings v2 Base – en, which is Jina AI’s English language embeddings model.
Choose Deploy.
In the dialog that appears, choose Subscribe, which will redirect you to the model’s AWS Marketplace listing, where you can subscribe to the model after accepting the terms of usage.
After subscribing, return to the Sagemaker Studio and choose Deploy.
You will be redirected to the endpoint configuration page, where you can select the instance most suitable for your use case and provide a name for the endpoint.
Choose Deploy.

After you create the endpoint, you can connect to it with the following code snippet:

from jina_sagemaker import Client
 
client = Client(region_name=region)
# Make sure that you’ve given the same name my-jina-embeddings-endpoint to the Jumpstart endpoint in the previous step.
endpoint_name = "my-jina-embeddings-endpoint"
 
client.connect_to_endpoint(endpoint_name=endpoint_name)

Preparing a dataset for indexing

In this post, we use a public dataset from Kaggle (CC0: Public Domain) that contains audio transcripts from the popular YouTube channel Kurzgesagt – In a Nutshell, which has over 20 million subscribers.

Each row in this dataset contains the title of a video, its URL, and the corresponding text transcript.

Enter the following code:

df.head()

Because the transcript of these videos can be quite long (around 10 minutes), in order to find only the relevant content for answering users’ questions and not other parts of the transcripts that are unrelated, you can chunk each of these transcripts before indexing them:

def chunk_text(text, max_words=1024):
    """
    Divide text into chunks where each chunk contains the maximum number of full sentences under `max_words`.
    """
    sentences = text.split('.')
    chunk = []
    word_count = 0
 
    for sentence in sentences:
        sentence = sentence.strip(".")
        if not sentence:
          continue
 
        words_in_sentence = len(sentence.split())
        if word_count + words_in_sentence <= max_words:
            chunk.append(sentence)
            word_count += words_in_sentence
        else:
            # Yield the current chunk and start a new one
            if chunk:
              yield '. '.join(chunk).strip() + '.'
            chunk = [sentence]
            word_count = words_in_sentence
 
    # Yield the last chunk if it's not empty
    if chunk:
        yield ' '.join(chunk).strip() + '.'

The parameter max_words defines the maximum number of full words that can be in a chunk of indexed text. Many chunking strategies exist in academic and non-peer-reviewed literature that are more sophisticated than a simple word limit. However, for the purpose of simplicity, we use this technique in this post.

Index text embeddings for vector search

After you chunk the transcript text, you obtain embeddings for each chunk and link each chunk back to the original transcript and video title:

def generate_embeddings(text_df):
    """
    Generate an embedding for each chunk created in the previous step.
    """

    chunks = list(chunk_text(text_df['Text']))
    embeddings = []
 
    for i, chunk in enumerate(chunks):
      response = client.embed(texts=[chunk])
      chunk_embedding = response[0]['embedding']
      embeddings.append(np.array(chunk_embedding))
 
    text_df['chunks'] = chunks
    text_df['embeddings'] = embeddings
    return text_df
 
print("Embedding text chunks ...")
df = df.progress_apply(generate_embeddings, axis=1)

The dataframe df contains a column titled embeddings that can be put into any vector database of your choice. Embeddings can then be retrieved from the vector database using a function such as find_most_similar_transcript_segment(query, n), which will retrieve the n closest documents to the given input query by a user.

Prompt a generative LLM endpoint

For question answering based on an LLM, you can use the Mistral 7B-Instruct model on SageMaker JumpStart:

from sagemaker.jumpstart.model import JumpStartModel
from string import Template

# Define the LLM to be used and deploy through Jumpstart.
jumpstart_model = JumpStartModel(model_id="huggingface-llm-mistral-7b-instruct", role=role)
model_predictor = jumpstart_model.deploy()

# Define the prompt template to be passed to the LLM
prompt_template = Template("""
  <s>[INST] Answer the question below only using the given context.
  The question from the user is based on transcripts of videos from a YouTube
    channel.
  The context is presented as a ranked list of information in the form of
    (video-title, transcript-segment), that is relevant for answering the
    user's question.
  The answer should only use the presented context. If the question cannot be
    answered based on the context, say so.
 
  Context:
  1. Video-title: $title_1, transcript-segment: $segment_1
  2. Video-title: $title_2, transcript-segment: $segment_2
  3. Video-title: $title_3, transcript-segment: $segment_3
 
  Question: $question
 
  Answer: [/INST]
""")

Query the LLM

Now, for a query sent by a user, you first find the semantically closest n chunks of transcripts from any video of Kurzgesagt (using vector distance between embeddings of chunks and the users’ query), and provide those chunks as context to the LLM for answering the users’ query:

# Define the query and insert it into the prompt template together with the context to be used to answer the question
question = "Can climate change be reversed by individuals' actions?"
search_results = find_most_similar_transcript_segment(question)
 
prompt_for_llm = prompt_template.substitute(
    question = question,
    title_1 = df.iloc[search_results[0][1]]["Title"].strip(),
    segment_1 = search_results[0][0],
    title_2 = df.iloc[search_results[1][1]]["Title"].strip(),
    segment_2 = search_results[1][0],
    title_3 = df.iloc[search_results[2][1]]["Title"].strip(),
    segment_3 = search_results[2][0]
)

# Generate the answer to the question passed in the propt
payload = {"inputs": prompt_for_llm}
model_predictor.predict(payload)

Based on the preceding question, the LLM might respond with an answer such as the following:

Based on the provided context, it does not seem that individuals can solve climate change solely through their personal actions. While personal actions such as using renewable energy sources and reducing consumption can contribute to mitigating climate change, the context suggests that larger systemic changes are necessary to address the issue fully.

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped. Use the following code:

model_predictor.delete_model()
model_predictor.delete_endpoint()

Conclusion

By taking advantage of the features of Jina Embeddings v2 to develop RAG applications, together with the streamlined access to state-of-the-art models on SageMaker JumpStart, developers and businesses are now empowered to create sophisticated AI solutions with ease.

Jina Embeddings v2’s extended context length, support for bilingual documents, and small model size enables enterprises to quickly build natural language processing use cases based on their internal datasets without relying on external APIs.

Get started with SageMaker JumpStart today, and refer to the GitHub repository for the complete code to run this sample.

Connect with Jina AI

Jina AI remains committed to leadership in bringing affordable and accessible AI embeddings technology to the world. Our state-of-the-art text embedding models support English and Chinese and soon will support German, with other languages to follow.

For more information about Jina AI’s offerings, check out the Jina AI website or join our community on Discord.

About the Authors

Francesco Kruk is Product Managment intern at Jina AI and is completing his Master’s at ETH Zurich in Management, Technology, and Economics. With a strong business background and his knowledge in machine learning, Francesco helps customers implement RAG solutions using Jina Embeddings in an impactful way.

Saahil Ognawala is Head of Product at Jina AI based in Munich, Germany. He leads the development of search foundation models and collaborates with clients worldwide to enable quick and efficient deployment of state-of-the-art generative AI products. With an academic background in machine learning, Saahil is now interested in scaled applications of generative AI in the knowledge economy.

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS based in Munich, Germany. Roy helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS. Roy is passionate about computational optimization problems and improving the performance of AI workloads.

Detect email phishing attempts using Amazon Comprehend

June 5, 2024

by Ajeet Tewari Amazon AWS

Phishing is the process of attempting to acquire sensitive information such as usernames, passwords and credit card details by masquerading as a trustworthy entity using email, telephone or text messages. There are many types of phishing based on the mode of communication and targeted victims. In an Email phishing attempt, an email is sent as a mode of communication to group of people. There are traditional rule-based approaches to detect email phishing. However, new trends are emerging that are hard to handle with a rule-based approach. There is need to use machine learning (ML) techniques to augment rule-based approaches for email phishing detection.

In this post, we show how to use Amazon Comprehend Custom to train and host an ML model to classify if the input email is an phishing attempt or not. Amazon Comprehend is a natural-language processing (NLP) service that uses ML to uncover valuable insights and connections in text. You can use Amazon Comprehend to identify the language of the text; extract key phrases, places, people, brands, or events; understand sentiment about products or services; and identify the main topics from a library of documents. You can customize Amazon Comprehend for your specific requirements without the skillset required to build ML-based NLP solutions. Comprehend Custom builds customized NLP models on your behalf, using training data that you provide. Comprehend Custom supports custom classification and custom entity recognition.

Solution overview

This post explains how you can use Amazon Comprehend to easily train and host an ML based model to detect phishing attempt. The following diagram shows how the phishing detection works.

You can use this solution with your email servers in which emails are passed through this phishing detector. When an email is flagged as a phishing attempt, the email recipient still gets the email in their mailbox, but they can be shown an additional banner highlighting a warning to the user.

You can use this solution for experimentation with the use case, but AWS recommends building a training pipeline for your environments. For details on how to build a classification pipeline with Amazon Comprehend, see Build a classification pipeline with Amazon Comprehend custom classification.

We walk through the following steps to build the phishing detection model:

Collect and prepare the dataset.
Load the data in an Amazon Simple Storage Service (Amazon S3) bucket.
Create the Amazon Comprehend custom classification model.
Create the Amazon Comprehend custom classification model endpoint.
Test the model.

Prerequisites

Before diving into this use case, complete the following prerequisites:

Set up an AWS account.
Create an S3 bucket. For instructions, see Create your first S3 bucket.
Download the email-trainingdata.csv and upload the file to the S3 bucket.

Collect and prepare the dataset

Your training data should have both phishing and non-phishing emails. Email users with in the organization are asked to report phishing through their email clients. Gather all these phishing reports and examples of non-phishing emails to prepare the training data. You should have a minimum 10 examples per class. Label phishing emails as phishing and non-phishing emails as nonphishing. For minimum training requirements, see General quotas for document classification. Although minimum labels per class is a starting point, it’s recommended to provide hundreds of labels per class for performance on classification tasks across new inputs.

For custom classification, you train the model in either single-label mode or multi-label mode. Single-label mode associates a single class with each document. Multi-label mode associates one or more classes with each document. For this case, we will use single-label mode – phishing or nonphishing. The individual classes are mutually exclusive. For example, you can classify an email as phishing or not-phishing, but not both.

Custom classification supports models that you train with plain-text documents and models that you train with native documents (such as PDF, Word, or images). For more information about classifier models and their supported document types, see Training classification models. For a plain-text model, you can provide classifier training data as a CSV file or as an augmented manifest file that you create using Amazon SageMaker Ground Truth. The CSV file or augmented manifest file includes the text for each training document, and its associated labels.For a native document model, you provide classifier training data as a CSV file. The CSV file includes the file name for each training document and its associated labels. You include the training documents in the S3 input folder for the training job.

For this case, we will train a plain-text model using CSV file format. For each row, the first column contains the class label value. The second column contains an example text document for that class. Each row must end with n or rn characters.

The following example shows a CSV file containing two documents.

CLASS,Text of document 1

CLASS,Text of document 2

The following example shows two rows of a CSV file that trains a custom classifier to detect whether an email message is phishing:

phishing, “Hi, we need account details and SSN information to complete the payment. Please furnish your credit card details in the attached form.”

nonphishing,” Dear Sir / Madam, your latest statement was mailed to your communication address. After your payment is received, you will receive a confirmation text message at your mobile number. Thanks, customer support”

For information about preparing your training documents, see Preparing classifier training data.

Load the data in the S3 bucket

Load the training data in CSV format to the S3 bucket you created in the prerequisite steps. For instructions, refer to Uploading objects.

Create the Amazon Comprehend custom classification model

Custom classification supports two types of classifier models: plain-text models and native document models. A plain-text model classifies documents based on their text content. You can train the plain-text model using documents in one of following languages: English, Spanish, German, Italian, French, or Portuguese. The training documents for a given classifier must all use the same language. A native document model has the ability to process both scanned or digital semi-structured documents like PDFs, Microsoft Word documents, and images in their native format. A native document model also classifies documents based on text content. A native document model can also use additional signals, such as from the layout of the document. You train a native document model with native documents for the model to learn the layout information. You train the model using semi-structured documents, which includes the following document types such as digital and scanned PDF documents and Word documents; Images sunch as JPG files, PNG files, and single-page TIFF files and Amazon Textract API output JSON files. AWS recommends using a plain-text model to classify plain-text documents and a native document model to classify semi-structured documents.

Data specification for the custom classification model can be represented as follows.

You can train a custom classifier using either the Amazon Comprehend console or API. Allow several minutes to a few hours for the classification model creation to complete. The length of time varies based on the size of your input documents.

For training a customer classifier on the Amazon Comprehend console, set the following data specification options.

On the Classifiers page of the Amazon Comprehend console, the new classifier appears in the table, showing Submitted as its status. When the classifier starts processing the training documents, the status changes to Training. When a classifier is ready to use, the status changes to Trained or Trained with warnings. If the status is Trained with Warnings, review the skipped files folder in the classifier training output.

If Amazon Comprehend encountered errors during creation or training, the status changes to In error. You can choose a classifier job in the table to get more information about the classifier, including any error messages.

After training the model, Amazon Comprehend tests the custom classifier model. If you don’t provide a test dataset, Amazon Comprehend trains the model with 90% of the training data. It reserves 10% of the training data to use for testing. If you do provide a test dataset, the test data must include at least one example for each unique label in the training dataset.

After Amazon Comprehend completes the custom classifier model training, it creates output files in the Amazon S3 output location that you specified in the CreateDocumentClassifier API request or the equivalent Amazon Comprehend console request. These output files are a confusion matrix and additional outputs for native document models. The format of the confusion matrix varies, depending on whether you trained your classifier using multi-class mode or multi-label mode.

After Amazon Comprehend creates the classifier model, the confusion matrix is available in the confusion_matrix.json file in the Amazon S3 output location. This confusion matrix provides metrics on how well the model performed in training. This matrix shows a matrix of labels that the model predicted, compared to the actual document labels. Amazon Comprehend uses a portion of the training data to create the confusion matrix. The following JSON file represents the matrix in confusion_matrix.json as an example.

Amazon Comprehend provides metrics to help you estimate how well a custom classifier performs. Amazon Comprehend calculates the metrics using the test data from the classifier training job. The metrics accurately represent the performance of the model during training, so they approximate the model performance for classification of similar data.

Use the Amazon Comprehend console or API operations such as DescribeDocumentClassifier to retrieve the metrics for a custom classifier.

The actual output of many binary classification algorithms is a prediction score. The score indicates the system’s certainty that the given observation belongs to the positive class. To make the decision about whether the observation should be classified as positive or negative, as a consumer of this score, you interpret the score by picking a classification threshold and comparing the score against it. Any observations with scores higher than the threshold are predicted as the positive class, and scores lower than the threshold are predicted as the negative class.

Create the Amazon Comprehend custom classification model endpoint

After you train a custom classifier, you can classify documents using Real-time analysis or an analysis job. Real-time analysis takes a single document as input and returns the results synchronously. An analysis job is an asynchronous job to analyze large documents or multiple documents in one batch. The following are the different options for using the custom classifier model.

Create an endpoint for the trained model. For instructions, refer to Real-tome analysis for customer classification (console). Amazon Comprehend assigns throughput to an endpoint using Inference Units (IU). An IU represents data throughput of 100 characters per second. You can provision the endpoint with up to 10 IU. You can scale the endpoint throughput either up or down by updating the endpoint. Endpoints are billed on 1-second increments, with a minimum of 60 seconds. Charges will continue to incur from the time you start the endpoint until it is deleted even if no documents are analyzed.

Test the Model

After the endpoint is ready, you can run the real-time analysis from the Amazon Comprehend console.

The sample input represents the email text, which is used for real-time analysis to detect if the email text is a phishing attempt or not.

Amazon Comprehend analyzes the input data using the custom model. Amazon Comprehend displays the discovered classes, along with a confidence assessment for each class. The insights section shows the inference results with confidence levels of the nonphishing and phishing classes. You can decide the threshold to decide the class of the inference. In this case, nonphishing is the inference results because this has more confidence than the phishing class. The model detects the input email text is a non-phishing email.

To integrate this capability of phishing detection in your real-world applications, you can use the Amazon API Gateway REST API with an AWS Lambda integration. Refer to the serverless pattern in Amazon API Gateway to AWS Lambda to Amazon Comprehend to know more.

Clean up

When you no longer need your endpoint, you should delete it so that you stop incurring costs from it. Also, delete the data file from S3 bucket. For more information on costs, see Amazon Comprehend Pricing.

Conclusion

In this post, we walked you through the steps to create a phishing attempt detector using Amazon Comprehend custom classification. You can customize Amazon Comprehend for your specific requirements without the skillset required to build ML-based NLP solutions.

You can also visit the Amazon Comprehend Developer Guide, GitHub repository and Amazon Comprehend developer resources for videos, tutorials, blogs, and more.

About the author

Ajeet Tewari is a Solutions Architect for Amazon Web Services. He works with enterprise customers to help them navigate their journey to AWS. His specialties include architecting and implementing highly scalable OLTP systems and leading strategic AWS initiatives.

How Skyflow creates technical content in days using Amazon Bedrock

June 5, 2024

by Manny Silva Amazon AWS

This guest post is co-written with Manny Silva, Head of Documentation at Skyflow, Inc.

Startups move quickly, and engineering is often prioritized over documentation. Unfortunately, this prioritization leads to release cycles that don’t match, where features release but documentation lags behind. This leads to increased support calls and unhappy customers.

Skyflow is a data privacy vault provider that makes it effortless to secure sensitive data and enforce privacy policies. Skyflow experienced this growth and documentation challenge in early 2023 as it expanded globally from 8 to 22 AWS Regions, including China and other areas of the world such as Saudi Arabia, Uzbekistan, and Kazakhstan. The documentation team, consisting of only two people, found itself overwhelmed as the engineering team, with over 60 people, updated the product to support the scale and rapid feature release cycles.

Given the critical nature of Skyflow’s role as a data privacy company, the stakes were particularly high. Customers entrust Skyflow with their data and expect Skyflow to manage it both securely and accurately. The accuracy of Skyflow’s technical content is paramount to earning and keeping customer trust. Although new features were released every other week, documentation for the features took an average of 3 weeks to complete, including drafting, review, and publication. The following diagram illustrates their content creation workflow.

Looking at our documentation workflows, we at Skyflow discovered areas where generative artificial intelligence (AI) could improve our efficiency. Specifically, creating the first draft—often referred to as overcoming the “blank page problem”—is typically the most time-consuming step. The review process could also be long depending on the number of inaccuracies found, leading to additional revisions, additional reviews, and additional delays. Both drafting and reviewing needed to be shorter to make doc target timelines match those of engineering.

To do this, Skyflow built VerbaGPT, a generative AI tool based on Amazon Bedrock. Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using the AWS tools without having to manage any infrastructure. With Amazon Bedrock, VerbaGPT is able to prompt large language models (LLMs), regardless of model provider, and uses Retrieval Augmented Generation (RAG) to provide accurate first drafts that make for quick reviews.

In this post, we share how Skyflow improved their workflow to create documentation in days instead of weeks using Amazon Bedrock.

Solution overview

VerbaGPT uses Contextual Composition (CC), a technique that incorporates a base instruction, a template, relevant context to inform the execution of the instruction, and a working draft, as shown in the following figure. For the instruction, VerbaGPT tells the LLM to create content based on the specified template, evaluate the context to see if it’s applicable, and revise the draft accordingly. The template includes the structure of the desired output, expectations for what sort of information should exist in a section, and one or more examples of content for each section to guide the LLM on how to process context and draft content appropriately. With the instruction and template in place, VerbaGPT includes as much available context from RAG results as it can, then sends that off for inference. The LLM returns the revised working draft, which VerbaGPT then passes back into a new prompt that includes the same instruction, the same template, and as much context as it can fit, starting from where the previous iteration left off. This repeats until all context is considered and the LLM outputs a draft matching the included template.

The following figure illustrates how Skyflow deployed VerbaGPT on AWS. The application is used by the documentation team and internal users. The solution involves deploying containers on Amazon Elastic Kubernetes Service (Amazon EKS) that host a Streamlit user interface and a backend LLM gateway that is able to invoke Amazon Bedrock or local LLMs, as needed. Users upload documents and prompt VerbaGPT to generate new content. In the LLM gateway, prompts are processed in Python using LangChain and Amazon Bedrock.

When building this solution on AWS, Skyflow followed these steps:

Choose an inference toolkit and LLMs.
Build the RAG pipeline.
Create a reusable, extensible prompt template.
Create content templates for each content type.
Build an LLM gateway abstraction layer.
Build a frontend.

Let’s dive into each step, including the goals and requirements and how they were addressed.

Choose an inference toolkit and LLMs

The inference toolkit you choose, if any, dictates your interface with your LLMs and what other tooling is available to you. VerbaGPT uses LangChain instead of directly invoking LLMs. LangChain has broad adoption in the LLM community, so there was a present and likely future ability to take advantage of the latest advancements and community support.

When building a generative AI application, there are many factors to consider. For instance, Skyflow wanted the flexibility to interact with different LLMs depending on the use case. We also needed to keep context and prompt inputs private and secure, which meant not using LLM providers who would log that information or fine-tune their models on our data. We needed to have a variety of models with unique strengths at our disposal (such as long context windows or text labeling) and to have inference redundancy and fallback options in case of outages.

Skyflow chose Amazon Bedrock for its robust support of multiple FMs and its focus on privacy and security. With Amazon Bedrock, all traffic remains inside AWS. VerbaGPT’s primary foundation model is Anthropic Claude 3 Sonnet on Amazon Bedrock, chosen for its substantial context length, though it also uses Anthropic Claude Instant on Amazon Bedrock for chat-based interactions.

Build the RAG pipeline

To deliver accurate and grounded responses from LLMs without the need for fine-tuning, VerbaGPT uses RAG to fetch data related to the user’s prompt. By using RAG, VerbaGPT became familiar with the nuances of Skyflow’s features and procedures, enabling it to generate informed and complimentary content.

To build your own content creation solution, you collect your corpus into a knowledge base, vectorize it, and store it in a vector database. VerbaGPT includes all of Skyflow’s documentation, blog posts, and whitepapers in a vector database that it can query during inference. Skyflow uses a pipeline to embed content and store the embedding in a vector database. This embedding pipeline is a multi-step process, and everyone’s pipeline is going to look a little different. Skyflow’s pipeline starts by moving artifacts to a common data store, where they are de-identified. If your documents have personally identifiable information (PII), payment card information (PCI), personal health information (PHI), or other sensitive data, you might use a solution like Skyflow LLM Privacy Vault to make de-identifying your documentation straightforward. Next, the pipeline chunks the documents into pieces, then finally calculates vectors for the text chunks and stores them in FAISS, an open source vector store. VerbaGPT uses FAISS because it is fast and straightforward to use from Python and LangChain. AWS also has numerous vector stores to choose from for a more enterprise-level content creation solution, including Amazon Neptune, Amazon Relational Database Service (Amazon RDS) for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon Kendra, Amazon OpenSearch Service, and Amazon DocumentDB (with MongoDB compatibility). The following diagram illustrates the embedding generation pipeline.

When chunking your documents, keep in mind that LangChain’s default splitting strategy can be aggressive. This can result in chunks of content that are so small that they lack meaningful context and result in worse output, because the LLM has to make (largely inaccurate) assumptions about the context, producing hallucinations. This issue is particularly noticeable in Markdown files, where procedures were fragmented, code blocks were divided, and chunks were often only single sentences. Skyflow created its own Markdown splitter to work more accurately with VerbaGPT’s RAG output content.

Create a reusable, extensible prompt template

After you deploy your embedding pipeline and vector database, you can start intelligently prompting your LLM with a prompt template. VerbaGPT uses a system prompt that instructs the LLM how to behave and includes a directive to use content in the Context section to inform the LLM’s response.

The inference process queries the vector database with the user’s prompt, fetches the results above a certain similarity threshold, and includes the results in the system prompt. The solution then sends the system prompt and the user’s prompt to the LLM for inference.

The following is a sample prompt for drafting with Contextual Composition that includes all the necessary components, system prompt, template, context, a working draft, and additional instructions:

System: """You're an expert writer tasked with creating content according to the user's request.
Use Template to structure your output and identify what kind of content should go in each section.
Use WorkingDraft as a base for your response.
Evaluate Context against Template to identify if there is any pertinent information.
If needed, update or refine WorkingDraft using the supplied Context.
Treat User input as additional instruction."""
---
Template: """Write a detailed how-to guide in Markdown using the following template:
# [Title]
This guide explains how to [insert a brief description of the task].
[Optional: Specify when and why your user might want to perform the task.]
...
"""
---
Context: [
  { "text": "To authenticate with Skyflow's APIs and SDKs, you need to create a service account. To create...", "metadata": { "source": "service-accounts.md" }},
  ...
]
---
WorkingDraft: ""
---
User: Create a how-to guide for creating a service account.

Create content templates

To round out the prompt template, you need to define content templates that match your desired output, such as a blog post, how-to guide, or press release. You can jumpstart this step by sourcing high-quality templates. Skyflow sourced documentation templates from The Good Docs Project. Then, we adapted the how-to and concept templates to align with internal styles and specific needs. We also adapted the templates for use in prompt templates by providing instructions and examples per section. By clearly and consistently defining the expected structure and intended content of each section, the LLM was able to output content in the formats needed, while being both informative and stylistically consistent with Skyflow’s brand.

LLM gateway abstraction layer

Amazon Bedrock provides a single API to invoke a variety of FMs. Skyflow also wanted to have inference redundancy and fallback options in case VerbaGPT experienced Amazon Bedrock service limit exceeded errors. To that end, VerbaGPT has an LLM gateway that acts as an abstraction layer that is invoked.

The main component of the gateway is the model catalog, which can return a LangChain llm model object for the specified model, updated to include any parameters. You can create this with a simple if/else statement like that shown in the following code:

from langchain.chains import LLMChain
from langchain_community.llms import Bedrock, CTransformers

prompt = ""   		# User input
prompt_template = ""   	# The LangChain-formatted prompt template object
rag_results = get_rag(prompt)   # Results from vector database

# Get chain-able model object and token limit.
def get_model(model=str,options=dict):
    if model == "claude-instant-v1":
        llm = Bedrock(
            model_id="anthropic.claude-instant-v1",
            model_kwargs={"max_tokens_to_sample": options["max_output_tokens"], "temperature": options["temperature"]}
        )
        token_limit = 100000

    elif model == "claude-v2.1":
        llm = Bedrock(
            model_id="anthropic.claude-v2.1",
            model_kwargs={"max_tokens_to_sample":  options["max_output_tokens"], "temperature": options["temperature"]}
        )
        token_limit = 200000

    elif model == "llama-2":
        config = {
            "context_length": 4096,
            "max_new_tokens": options["max_output_tokens"],
            "stop": [
                "Human:",
            ],
        }
        llm = CTransformers(
            model="TheBloke/Llama-2-7b-Chat-GGUF",
            model_file="llama-2-7b-chat.Q4_K_M.gguf",
            model_type="llama",
            config=config,
        )
        token_limit = 4096
  
    return llm, token_limit

llm, token_limit = get_model("claude-v2.1")

chain = LLMChain(
    llm=llm,
    prompt=prompt_template
)

response = chain.run({"input": prompt, "context":rag_results})

By mapping standard input formats into the function and handling all custom LLM object construction within the function, the rest of the code stays clean by using LangChain’s llm object.

Build a frontend

The final step was to add a UI on top of the application to hide the inner workings of LLM calls and context. A simple UI is key for generative AI applications, so users can efficiently prompt the LLMs without worrying about the details unnecessary to their workflow. As shown in the solution architecture, VerbaGPT uses Streamlit to quickly build useful, interactive UIs that allow users to upload documents for additional context and draft new documents rapidly using Contextual Composition. Streamlit is Python based, which makes it straightforward for data scientists to be efficient at building UIs.

Results

By using the power of Amazon Bedrock for inferencing and Skyflow for data privacy and sensitive data de-identification, your organization can significantly speed up the production of accurate, secure technical documents, just like the solution shown in this post. Skyflow was able to use existing technical content and best-in-class templates to reliably produce drafts of different content types in minutes instead of days. For example, given a product requirements document (PRD) and an engineering design document, VerbaGPT can produce drafts for a how-to guide, conceptual overview, summary, release notes line item, press release, and blog post within 10 minutes. Normally, this would take multiple individuals from different departments multiple days each to produce.

The new content flow shown in the following figure moves generative AI to the front of all technical content Skyflow creates. During the “Create AI draft” step, VerbaGPT generates content in the approved style and format in just 5 minutes. Not only does this solve the blank page problem, first drafts are created with less interviewing or asking engineers to draft content, freeing them to add value through feature development instead.

The security measures Amazon Bedrock provides around prompts and inference aligned with Skyflow’s commitment to data privacy, and allowed Skyflow to use additional kinds of context, such as system logs, without the concern of compromising sensitive information in third-party systems.

As more people at Skyflow used the tool, they wanted additional content types available: VerbaGPT now has templates for internal reports from system logs, email templates from common conversation types, and more. Additionally, although Skyflow’s RAG context is clean, VerbaGPT is integrated with Skyflow LLM Privacy Vault to de-identify sensitive data in user inference inputs, maintaining Skyflow’s stringent standards of data privacy and security even while using the power of AI for content creation.

Skyflow’s journey in building VerbaGPT has drastically shifted content creation, and the toolkit wouldn’t be as robust, accurate, or flexible without Amazon Bedrock. The significant reduction in content creation time—from an average of around 3 weeks to as little as 5 days, and sometimes even a remarkable 3.5 days—marks a substantial leap in efficiency and productivity, and highlights the power of AI in enhancing technical content creation.

Conclusion

Don’t let your documentation lag behind your product development. Start creating your technical content in days instead of weeks, while maintaining the highest standards of data privacy and security. Learn more about Amazon Bedrock and discover how Skyflow can transform your approach to data privacy.

If you’re scaling globally and have privacy or data residency needs for your PII, PCI, PHI, or other sensitive data, reach out to your AWS representative to see if Skyflow is available in your region.

About the authors

Manny Silva is Head of Documentation at Skyflow and the creator of Doc Detective. Technical writer by day and engineer by night, he’s passionate about intuitive and scalable developer experiences and likes diving into the deep end as the 0th developer.

Jason Westra is a Senior Solutions Architect for AWS AI/ML startups. He provides guidance and technical assistance that enables customers to build scalable, highly available, secure AI and ML workloads in AWS Cloud.

Streamline custom model creation and deployment for Amazon Bedrock with Provisioned Throughput using Terraform

June 4, 2024

by Josh Famestad Amazon AWS

As customers seek to incorporate their corpus of knowledge into their generative artificial intelligence (AI) applications, or to build domain-specific models, their data science teams often want to conduct A/B testing and have repeatable experiments. In this post, we discuss a solution that uses infrastructure as code (IaC) to define the process of retrieving and formatting data for model customization and initiating the model customization. This enables you to version and iterate as needed.

With Amazon Bedrock, you can privately and securely customize foundation models (FMs) with your own data to build applications that are specific to your domain, organization, and use case. With custom models, you can create unique user experiences that reflect your company’s style, voice, and services.

Amazon Bedrock supports two methods of model customization:

Fine-tuning allows you to increase model accuracy by providing your own task-specific labeled training dataset and further specialize your FMs.
Continued pre-training allows you to train models using your own unlabeled data in a secure and managed environment and supports customer-managed keys. Continued pre-training helps models become more domain-specific by accumulating more robust knowledge and adaptability—beyond their original training.

In this post, we provide guidance on how to create an Amazon Bedrock custom model using HashiCorp Terraform that allows you to automate the process, including preparing datasets used for customization.

Terraform is an IaC tool that allows you to manage AWS resources, software as a service (SaaS) resources, datasets, and more, using declarative configuration. Terraform provides the benefits of automation, versioning, and repeatability.

Solution overview

We use Terraform to download a public dataset from the Hugging Face Hub, convert it to JSONL format, and upload it to an Amazon Simple Storage Service (Amazon S3) bucket with a versioned prefix. We then create an Amazon Bedrock custom model using fine-tuning, and create a second model using continued pre-training. Lastly, we configure Provisioned Throughput for our new models so we can test and deploy the custom models for wider usage.

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

The user runs the terraform apply The Terraform local-exec provisioner is used to run a Python script that downloads the public dataset DialogSum from the Hugging Face Hub. This is then used to create a fine-tuning training JSONL file.
An S3 bucket stores training, validation, and output data. The generated JSONL file is uploaded to the S3 bucket.
The FM defined in the Terraform configuration is used as the source for the custom model training job.
The custom model training job uses the fine-tuning training data stored in the S3 bucket to enrich the FM. Amazon Bedrock is able to access the data in the S3 bucket (including output data) due to the AWS Identity and Access Management (IAM) role defined in the Terraform configuration, which grants access to the S3 bucket.
When the custom model training job is complete, the new custom model is available for use.

The high-level steps to implement this solution are as follows:

Create and initialize a Terraform project.
Create data sources for context lookup.
Create an S3 bucket to store training, validation, and output data.
Create an IAM service role that allows Amazon Bedrock to run a model customization job, access your training and validation data, and write your output data to your S3 bucket.
Configure your local Python virtual environment.
Download the DialogSum public dataset and convert it to JSONL.
Upload the converted dataset to Amazon S3.
Create an Amazon Bedrock custom model using fine-tuning.
Configure custom model Provisioned Throughput for your models.

Prerequisites

This solution requires the following prerequisites:

An AWS account. If you don’t have an account, you can sign up for one.
Access to an IAM user or role that has the required permissions for creating a custom service role, creating an S3 bucket, uploading objects to that S3 bucket, and creating Amazon Bedrock model customization jobs.
Configured model access for Cohere Command-Light (if you plan to use that model).
Terraform v1.0.0 or newer installed.
python v3.8 or newer installed.

Create and initialize a Terraform project

Complete the following steps to create a new Terraform project and initialize it. You can work in a local folder of your choosing.

In your preferred terminal, create a new folder named bedrockcm and change to that folder:
1. If on Windows, use the following code:
```
md bedrockcm
cd bedrockcm
```
2. If on Mac or Linux, use the following code:
```
mkdir bedrockcm
cd bedrockcm
```

Now you can work in a text editor and enter in code.

In your preferred text editor, add a new file with the following Terraform code:

terraform {
  required_version = ">= 1.0.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 5.35.0"
    }
  }
}

Save the file in the root of the bedrockcm folder and name it main.tf.
In your terminal, run the following command to initialize the Terraform working directory:

terraform init

The output will contain a successful message like the following:

“Terraform has been successfully initialized”

In your terminal, validate the syntax for your Terraform files:

terraform validate

Create data sources for context lookup

The next step is to add configurations that define data sources that look up information about the context Terraform is currently operating in. These data sources are used when defining the IAM role and policies and when creating the S3 bucket. More information can be found in the Terraform documentation for aws_caller_identity, aws_partition, and aws_region.

In your text editor, add the following Terraform code to your main.tf file:

# Data sources to query the current context Terraform is operating in
data "aws_caller_identity" "current" {}
data "aws_partition" "current" {}
data "aws_region" "current" {}

Save the file.

Create an S3 bucket

In this step, you use Terraform to create an S3 bucket to use during model customization and associated outputs. S3 bucket names are globally unique, so you use the Terraform data source aws_caller_identity, which allows you to look up the current AWS account ID, and use string interpolation to include the account ID in the bucket name. Complete the following steps:

Add the following Terraform code to your main.tf file:

# Create a S3 bucket
resource "aws_s3_bucket" "model_training" {
  bucket = "model-training-${data.aws_caller_identity.current.account_id}"
}

Save the file.

Create an IAM service role for Amazon Bedrock

Now you create the service role that Amazon Bedrock will assume to operate the model customization jobs.

You first create a policy document, assume_role_policy, which defines the trust relationship for the IAM role. The policy allows the bedrock.amazonaws.com service to assume this role. You use global condition context keys for cross-service confused deputy prevention. There are also two conditions you specify: the source account must match the current account, and the source ARN must be an Amazon Bedrock model customization job operating from the current partition, AWS Region, and current account.

Complete the following steps:

Add the following Terraform code to your main.tf file:

# Create a policy document to allow Bedrock to assume the role
data "aws_iam_policy_document" "assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]
    effect  = "Allow"
    principals {
      type        = "Service"
      identifiers = ["bedrock.amazonaws.com"]
    }
    condition {
      test     = "StringEquals"
      variable = "aws:SourceAccount"
      values   = [data.aws_caller_identity.current.account_id]
    }
    condition {
      test     = "ArnEquals"
      variable = "aws:SourceArn"
      values   = ["arn:${data.aws_partition.current.partition}:bedrock:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:model-customization-job/*"]
    }
  }
}

The second policy document, bedrock_custom_policy, defines permissions for accessing the S3 bucket you created for model training, validation, and output. The policy allows the actions GetObject, PutObject, and ListBucket on the resources specified, which are the ARN of the model_training S3 bucket and all of the buckets contents. You will then create an aws_iam_policy resource, which creates the policy in AWS.

Add the following Terraform code to your main.tf file:

# Create a policy document to allow Bedrock to access the S3 bucket
data "aws_iam_policy_document" "bedrock_custom_policy" {
  statement {
    sid       = "AllowS3Access"
    actions   = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
    resources = [aws_s3_bucket.model_training.arn, "${aws_s3_bucket.model_training.arn}/*"]
  }
}

resource "aws_iam_policy" "bedrock_custom_policy" {
  name_prefix = "BedrockCM-"
  description = "Policy for Bedrock Custom Models customization jobs"
  policy      = data.aws_iam_policy_document.bedrock_custom_policy.json
}

Finally, the aws_iam_role resource, bedrock_custom_role, creates an IAM role with a name prefix of BedrockCM- and a description. The role uses assume_role_policy as its trust policy and bedrock_custom_policy as a managed policy to allow the actions specified.

Add the following Terraform code to your main.tf file:

# Create a role for Bedrock to assume
resource "aws_iam_role" "bedrock_custom_role" {
  name_prefix = "BedrockCM-"
  description = "Role for Bedrock Custom Models customization jobs"

  assume_role_policy  = data.aws_iam_policy_document.assume_role_policy.json
  managed_policy_arns = [aws_iam_policy.bedrock_custom_policy.arn]
}

Save the file.

Configure your local Python virtual environment

Python supports creating lightweight virtual environments, each with their own independent set of Python packages installed. You create and activate a virtual environment, and then install the datasets package.

In your terminal, in the root of the bedrockcm folder, run the following command to create a virtual environment:

python3 -m venv venv

Activate the virtual environment:
1. If on Windows, use the following command:
```
venvScriptsactivate
```
2. If on Mac or Linux, use the following command:
```
source venv/bin/activate
```

Now you install the datasets package via pip.

In your terminal, run the following command to install the datasets package:

pip3 install datasets

Download the public dataset

You now use Terraform’s local-exec provisioner to invoke a local Python script that will download the public dataset DialogSum from the Hugging Face Hub. The dataset is already divided into training, validation, and testing splits. This example uses just the training split.

You prepare the data for training by removing the id and topic columns, renaming the dialogue and summary columns, and truncating the dataset to 10,000 records. You then save the dataset in JSONL format. You could also use your own internal private datasets; we use a public dataset for example purposes.

You first create the local Python script named dialogsum-dataset-finetune.py, which is used to download the dataset and save it to disk.

In your text editor, add a new file with the following Python code:

import pandas as pd
from datasets import load_dataset

# Load the dataset from the huggingface hub
dataset = load_dataset("knkarthick/dialogsum")

# Convert the dataset to a pandas DataFrame
dft = dataset['train'].to_pandas()

# Drop the columns that are not required for fine-tuning
dft = dft.drop(columns=['id', 'topic'])

# Rename the columns to prompt and completion as required for fine-tuning.
# Ref: https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-prereq.html#model-customization-prepare
dft = dft.rename(columns={"dialogue": "prompt", "summary": "completion"})

# Limit the number of rows to 10,000 for fine-tuning
dft = dft.sample(10000,
    random_state=42)

# Save DataFrame as a JSONL file, with each line as a JSON object
dft.to_json('dialogsum-train-finetune.jsonl', orient='records', lines=True)

Save the file in the root of the bedrockcm folder and name it dialogsum-dataset-finetune.py.

Next, you edit the main.tf file you have been working in and add the terraform_data resource type, uses a local provisioner to invoke your Python script.

In your text editor, edit the main.tf file and add the following Terraform code:

resource "terraform_data" "training_data_fine_tune_v1" {
  input = "dialogsum-train-finetune.jsonl"

  provisioner "local-exec" {
    command = "python dialogsum-dataset-finetune.py"
  }
}

Upload the converted dataset to Amazon S3

Terraform provides the aws_s3_object resource type, which allows you to create and manage objects in S3 buckets. In this step, you reference the S3 bucket you created earlier and the terraform_data resource’s output attribute. This output attribute is how you instruct the Terraform resource graph that these resources need to be created with a dependency order.

In your text editor, edit the main.tf file and add the following Terraform code:

resource "aws_s3_object" "v1_training_fine_tune" {
  bucket = aws_s3_bucket.model_training.id
  key    = "training_data_v1/${terraform_data.training_data_fine_tune_v1.output}"
  source = terraform_data.training_data_fine_tune_v1.output
}

Create an Amazon Bedrock custom model using fine-tuning

Amazon Bedrock has multiple FMs that support customization with fine-tuning. To see a list of the models available, use the following AWS Command Line Interface (AWS CLI) command:

In your terminal, run the following command to list the FMs that support customization by fine-tuning:

aws bedrock list-foundation-models --by-customization-type FINE_TUNING

You use the Cohere Command-Light FM for this model customization. You add a Terraform data source to query the foundation model ARN using the model name. You then create the Terraform resource definition for aws_bedrock_custom_model, which creates a model customization job, and immediately returns.

The time it takes for model customization is non-deterministic, and is based on the input parameters, model used, and other factors.

In your text editor, edit the main.tf file and add the following Terraform code:

data "aws_bedrock_foundation_model" "cohere_command_light_text_v14" {
  model_id = "cohere.command-light-text-v14:7:4k"
}

resource "aws_bedrock_custom_model" "cm_cohere_v1" {
  custom_model_name     = "cm_cohere_v001"
  job_name              = "cm.command-light-text-v14.v001"
  base_model_identifier = data.aws_bedrock_foundation_model.cohere_command_light_text_v14.model_arn
  role_arn              = aws_iam_role.bedrock_custom_role.arn
  customization_type    = "FINE_TUNING"

  hyperparameters = {
    "epochCount"             = "1"
    "batchSize"              = "8"
    "learningRate"           = "0.00001"
    "earlyStoppingPatience"  = "6"
    "earlyStoppingThreshold" = "0.01"
    "evalPercentage"         = "20.0"
  }

  output_data_config {
    s3_uri = "s3://${aws_s3_bucket.model_training.id}/output_data_v1/"
  }

  training_data_config {
    s3_uri = "s3://${aws_s3_bucket.model_training.id}/training_data_v1/${terraform_data.training_data_fine_tune_v1.output}"
  }
}

Save the file.

Now you use Terraform to create the data sources and resources defined in your main.tf file, which will start a model customization job.

In your terminal, run the following command to validate the syntax for your Terraform files:

terraform validate

Run the following command to apply the configuration you created. Before creating the resources, Terraform will describe all the resources that will be created so you can verify your configuration:

terraform apply

Terraform will generate a plan and ask you to approve the actions, which will look similar to the following code:

...

Plan: 6 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value:

Enter yes to approve the changes.

Terraform will now apply your configuration. This process runs for a few minutes. At this time, your custom model is not yet ready for use; it will be in a Training state. Wait for training to finish before continuing. You can review the status on the Amazon Bedrock console on the Custom models page.

When the process is complete, you receive a message like the following:

Apply complete! Resources: 6 added, 0 changed, 0 destroyed.

You can also view the status on the Amazon Bedrock console.

You have now created an Amazon Bedrock custom model using fine-tuning.

Configure custom model Provisioned Throughput

Amazon Bedrock allows you to run inference on custom models by purchasing Provisioned Throughput. This guarantees a consistent level of throughput in exchange for a term commitment. You specify the number of model units needed to meet your application’s performance needs. For evaluating custom models initially, you can purchase Provisioned Throughput hourly (on-demand) with no long-term commitment. With no commitment, a quota of one model unit is available per Provisioned Throughput.

You create a new resource for Provisioned Throughput, associate one of your custom models, and provide a name. You omit the commitment_duration attribute to use on-demand.

In your text editor, edit the main.tf file and add the following Terraform code:

resource "aws_bedrock_provisioned_model_throughput" "cm_cohere_provisioned_v1" {
  provisioned_model_name = "${aws_bedrock_custom_model.cm_cohere_v1.custom_model_name}-provisioned"
  model_arn              = aws_bedrock_custom_model.cm_cohere_v1.custom_model_arn
  model_units            = 1 
}

Save the file.

Now you use Terraform to create the resources defined in your main.tf file.

In your terminal, run the following command to re-initialize the Terraform working directory:

terraform init

The output will contain a successful message like the following:

“Terraform has been successfully initialized”

Validate the syntax for your Terraform files:

terraform validate

Run the following command to apply the configuration you created:

terraform apply

Best practices and considerations

Note the following best practices when using this solution:

Data and model versioning – You can version your datasets and models by using version identifiers in your S3 bucket prefixes. This allows you to compare model efficacy and outputs. You could even operate a new model in a shadow deployment so that your team can evaluate the output relative to your models being used in production.
Data privacy and network security – With Amazon Bedrock, you are in control of your data, and all your inputs and customizations remain private to your AWS account. Your data, such as prompts, completions, custom models, and data used for fine-tuning or continued pre-training, is not used for service improvement and is never shared with third-party model providers. Your data remains in the Region where the API call is processed. All data is encrypted in transit and at rest. You can use AWS PrivateLink to create a private connection between your VPC and Amazon Bedrock.
Billing – Amazon Bedrock charges for model customization, storage, and inference. Model customization is charged per tokens processed. This is the number of tokens in the training dataset multiplied by the number of training epochs. An epoch is one full pass through the training data during customization. Model storage is charged per month, per model. Inference is charged hourly per model unit using Provisioned Throughput. For detailed pricing information, see Amazon Bedrock Pricing.
Custom models and Provisioned Throughput – Amazon Bedrock allows you to run inference on custom models by purchasing Provisioned Throughput. This guarantees a consistent level of throughput in exchange for a term commitment. You specify the number of model units needed to meet your application’s performance needs. For evaluating custom models initially, you can purchase Provisioned Throughput hourly with no long-term commitment. With no commitment, a quota of one model unit is available per Provisioned Throughput. You can create up to two Provisioned Throughputs per account.
Availability – Fine-tuning support on Meta Llama 2, Cohere Command Light, and Amazon Titan Text FMs is available today in Regions US East (N. Virginia) and US West (Oregon). Continued pre-training is available today in public preview in Regions US East (N. Virginia) and US West (Oregon). To learn more, visit the Amazon Bedrock Developer Experience and check out Custom models.

Clean up

When you no longer need the resources created as part of this post, clean up those resources to save associated costs. You can clean up the AWS resources created in this post using Terraform with the terraform destroy command.

First, you need to modify the configuration of the S3 bucket in the main.tf file to enable force destroy so the contents of the bucket will be deleted, so the bucket itself can be deleted. This will remove all of the sample data contained in the S3 bucket as well as the bucket itself. Make sure there is no data you want to retain in the bucket before proceeding.

Modify the declaration of your S3 bucket to set the force_destroy attribute of the S3 bucket:

# Create a S3 bucket
resource "aws_s3_bucket" "model_training" {
  bucket = "model-training-${data.aws_caller_identity.current.account_id}"
  force_destroy = true
}

Run the terraform apply command to update the S3 bucket with this new configuration:

terraform apply

Run the terraform destroy command to delete all resources created as part of this post:

terraform destroy

Conclusion

In this post, we demonstrated how to create Amazon Bedrock custom models using Terraform. We introduced GitOps to manage model configuration and data associated with your custom models.

We recommend testing the code and examples in your development environment, and making appropriate changes as required to use them in production. Consider your model consumption requirements when defining your Provisioned Throughput.

We welcome your feedback! If you have questions or suggestions, leave them in the comments section.

About the Authors

Josh Famestad is a Solutions Architect at AWS helping public sector customers accelerate growth, add agility, and reduce risk with cloud-based solutions.

Kevon Mayers is a Solutions Architect at AWS. Kevon is a Core Contributor for Terraform and has led multiple Terraform initiatives within AWS. Prior to joining AWS, he was working as a DevOps engineer and developer, and before that was working with the GRAMMYs/The Recording Academy as a studio manager, music producer, and audio engineer.

Tyler Lynch is a Principal Solution Architect at AWS. Tyler leads Terraform provider engineering at AWS and is a Core Contributor for Terraform.

Boost productivity with video conferencing transcripts and summaries with the Amazon Chime SDK Meeting Summarizer solution

June 4, 2024

by Adam Neumiller Amazon AWS

Businesses today heavily rely on video conferencing platforms for effective communication, collaboration, and decision-making. However, despite the convenience these platforms offer, there are persistent challenges in seamlessly integrating them into existing workflows. One of the major pain points is the lack of comprehensive tools to automate the process of joining meetings, recording discussions, and extracting actionable insights from them. This gap results in inefficiencies, missed opportunities, and limited productivity, hindering the seamless flow of information and decision-making processes within organizations.

To address this challenge, we’ve developed the Amazon Chime SDK Meeting Summarizer application deployed with the Amazon Cloud Development Kit (AWS CDK). This application uses an Amazon Chime SDK SIP media application, Amazon Transcribe, and Amazon Bedrock to seamlessly join meetings, record meeting audio, and process recordings for transcription and summarization. By integrating these services programmatically through the AWS CDK, we aim to streamline the meeting workflow, empower users with actionable insights, and drive better decision-making outcomes. Our solution currently integrates with popular platforms such as Amazon Chime, Zoom, Cisco Webex, Microsoft Teams, and Google Meet.

In addition to deploying the solution, we’ll also teach you the intricacies of prompt engineering in this post. We guide you through addressing parsing and information extraction challenges, including speaker diarization, call scheduling, summarization, and transcript cleaning. Through detailed instructions and structured approaches tailored to each use case, we illustrate the effectiveness of Amazon Bedrock, powered by Anthropic Claude models.

Solution overview

The following infrastructure diagram provides an overview of the AWS services that are used to create this meeting summarization bot. The core services used in this solution are:

An Amazon Chime SDK SIP Media Application is used to dial into the meeting and record meeting audio
Amazon Transcribe is used to perform speech-to-text processing of the recorded audio, including speaker diarization
Anthropic Claude models in Amazon Bedrock are used to identify names, improve the quality of the transcript, and provide a detailed summary of the meeting

For a detailed explanation of the solution, refer to the Amazon Chime Meeting Summarizer documentation.

Prerequisites

Before diving into the project setup, make sure you have the following requirements in place:

Yarn – Yarn must be installed on your machine.
AWS account – You’ll need an active AWS account.
Enable Claude Anthropic models – These models should be enabled in your AWS account. For more information, see Model access.
Enable Amazon Titan – Amazon Titan should be activated in your AWS account. For more information, see Amazon Titan Models.

Refer to our GitHub repository for a step-by-step guide on deploying this solution.

Access the meeting with Amazon Chime SDK

To capture the audio of a meeting, an Amazon Chime SDK SIP media application will dial into the meeting using the meeting provider’s dial-in number. The Amazon Chime SDK SIP media application (SMA) is a programable telephony service that will make a phone call over the public switched telephone network (PSTN) and capture the audio. SMA uses a request/response model with an AWS Lambda function to process actions. In this demo, an outbound call is made using the CreateSipMediaApplicationCall API. This will cause the Lambda function to be invoked with a NEW_OUTBOUND_CALL event.

Because most dial-in mechanisms for meetings require a PIN or other identification to be made, the SMA will use the SendDigits action to send these dual tone multi-frequency (DTMF) digits to the meeting provider. When the application has joined the meeting, it will introduce itself using the Speak action and then record the audio using the RecordAudio action. This audio will be saved in MP3 format and saved to an Amazon Simple Storage Service (Amazon S3) bucket.

Speaker diarization with Amazon Transcribe

Because the SMA is joined to the meeting as a participant, the audio will be a single channel of all the participants. To process this audio file, Amazon Transcribe will be used with the ShowSpeakerLabels setting:

const response = await transcribeClient.send(
    new StartTranscriptionJobCommand({
        TranscriptionJobName: jobName,
        IdentifyLanguage: true,
        MediaFormat: 'wav',
        Media: {
            MediaFileUri: audioSource,
        },
        Settings: {
            ShowSpeakerLabels: true,
            MaxSpeakerLabels: 10,
        },
        OutputBucketName: `${BUCKET}`,
        OutputKey: `${PREFIX_TRANSCRIBE_S3}/`,
    }),
);

With speaker diarization, Amazon Transcribe will distinguish different speakers in the transcription output. The JSON file that is produced will include the transcripts and items along with speaker labels grouped by speaker with start and end timestamps. With this information provided by Amazon Transcribe, a turn-by-turn transcription can be generated by parsing through the JSON. The result will be a more readable transcription. See the following example:

spk_0: Hey Court , how’s it going ?
spk_1: Hey Adam, it’s going good . How are you
spk_0: doing ? Well , uh hey , thanks for uh joining me today on this call . I’m excited to talk to you about uh architecting on Aws .
spk_1: Awesome . Yeah , thank you for inviting me . So ,
spk_0: uh can you tell me a little bit about uh the servers you’re currently using on premises ?
spk_1: Yeah . So for our servers , we currently have four web servers running with four gigabyte of RA M and two CP US and we’re currently running Linux as our operating system .
spk_0: Ok . And what are you using for your database ?
spk_1: Oh , yeah , for a database , we currently have a 200 gigabyte database running my as to will and I can’t remember the version . But um the thing about our database is sometimes it lags . So we’re really looking for a faster option for
spk_0: that . So , um when you’re , you’re noticing lags with reads or with rights to the database
spk_1: with , with reads .
spk_0: Yeah . Ok . Have you heard of uh read replicas ?
spk_1: I have not .
spk_0: Ok . That could be a potential uh solution to your problem . Um If you don’t mind , I’ll go ahead and just uh send you some information on that later for you and your team to review .
spk_1: Oh , yeah , thank you , Adam . Yeah . Anything that could help . Yeah , be really helpful .
spk_0: Ok , last question before I let you go . Um what are you doing uh to improve the security of your on premises ? Uh data ?
spk_1: Yeah , so , so number one , we have been experiencing some esto injection attacks . We do have a Palo Alto firewall , but we’re not able to fully achieve layer server protection . So we do need a better option for that .
spk_0: Have you ex have you experienced uh my sequel attacks in the past or sequel injections ?
spk_1: Yes.
spk_0: Ok , great . Uh All right . And then are you guys bound by any compliance standards like PC I DS S or um you know GDR ? Uh what’s another one ? Any of those C just uh
spk_1: we are bound by fate , moderate complaints . So .
spk_0: Ok . Well , you have to transition to fed ramp high at any time .
spk_1: Uh Not in the near future . No .
spk_0: Ok . All right , Court . Well , thanks for that uh context . I will be reaching out to you again here soon with a follow up email and uh yeah , I’m looking forward to talking to you again uh next week .
spk_1: Ok . Sounds good . Thank you , Adam for your help .
spk_0: All right . Ok , bye .
spk_1: All right . Take care .

Here, speakers have been identified based on the order they spoke. Next, we show you how to further enhance this transcription by identifying speakers using their names, rather than spk_0, spk_1, and so on.

Use Anthropic Claude and Amazon Bedrock to enhance the solution

This application uses a large language model (LLM) to complete the following tasks:

Speaker name identification
Transcript cleaning
Call summarization
Meeting invitation parsing

Prompt engineering for speaker name identification

The first task is to enhance the transcription by assigning names to speaker labels. These names are extracted from the transcript itself when a person introduces themselves and then are returned as output in JSON format by the LLM.

Special instructions are provided for cases where only one speaker is identified to provide consistency in the response structure. By following these instructions, the LLM will process the meeting transcripts and accurately extract the names of the speakers without additional words or spacing in the response. If no names are identified by the LLM, we prompted the model to return an Unknown tag.

In this demo, the prompts are designed using Anthropic Claude Sonnet as the LLM. You may need to tune the prompts to modify the solution to use another available model on Amazon Bedrock.

Human: You are a meeting transcript names extractor. Go over the transcript and extract the names from it. Use the following instructions in the <instructions></instructions> xml tags
<transcript> ${transcript} </transcript>
<instructions>
– Extract the names like this example – spk_0: “name1”, spk_1: “name2”.
– Only extract the names like the example above and do not add any other words to your response
– Your response should only have a list of “speakers” and their associated name separated by a “:” surrounded by {}
– if there is only one speaker identified then surround your answer with {}
– the format should look like this {“spk_0” : “Name”, “spk_1: “Name2”, etc.}, no unnecessary spacing should be added
</instructions>

Assistant: Should I add anything else in my answer?

Human: Only return a JSON formatted response with the Name and the speaker label associated to it. Do not add any other words to your answer. Do NOT EVER add any introductory sentences in your answer. Only give the names of the speakers actively speaking in the meeting. Only give the names of the speakers actively speaking in the meeting in the format shown above.

Assistant:

After the speakers are identified and returned in JSON format, we can replace the generic speaker labels with the name attributed speaker labels in the transcript. The result will be a more enhanced transcription:

Adam: Hey Court , how’s it going ?
Court: Hey Adam , it’s going good . How are you
Adam: doing ? Well , uh hey , thanks for uh joining me today on this call . I’m excited to talk to you about uh architecting on Aws .
Court: Awesome . Yeah , thank you for inviting me . So ,
Adam: uh can you tell me a little bit about uh the servers you’re currently using on premises ?
Court: Yeah . So for our servers , we currently have four web servers running with four gigabyte of RA M and two CP US and we’re currently running Linux as our operating system .

… transcript continues….

But what if a speaker can’t be identified because they never introduced themselves. In that case, we want to let the LLM leave them as unknown, rather than try to force or a hallucinate a label.

We can add the following to the instructions:

If no name is found for a speaker, use UNKNOWN_X where X is the speaker label number

The following transcript has three speakers, but only two identified speakers. The LLM must label a speaker as UNKNOWN rather than forcing a name or other response on the speaker.

spk_0: Yeah .
spk_1: Uh Thank you for joining us Court . I am your account executive here at Aws . Uh Joining us on the call is Adam , one of our solutions architect at Aws . Adam would like to introduce yourself .
spk_2: Hey , everybody . High Court . Good to meet you . Uh As your dedicated solutions architect here at Aws . Uh my job is to help uh you at every step of your cloud journey . Uh My , my role involves helping you understand architecting on Aws , including best practices for the cloud . Uh So with that , I’d love it if you could just take a minute to introduce yourself and maybe tell me a little about what you’re currently running on premises .
spk_0: Yeah , great . It’s uh great to meet you , Adam . Um My name is Court . I am the V P of engineering here at any company . And uh yeah , really excited to hear what you can do for us .
spk_1: Thanks , work . Well , we , we talked a little bit of last time about your goals for migrating to Aws . I invited Adam to , to join us today to get a better understanding of your current infrastructure and other technical requirements .
spk_2: Yeah . So co could you tell me a little bit about what you’re currently running on premises ?
spk_1: Sure . Yeah , we’re , uh ,
spk_0: we’re running a three tier web app on premise .

When we give Claude Sonnet the option to not force a name , we see results like this:

{“spk_0”: “Court”, “spk_1”: “UNKNOWN_1”, “spk_2”: “Adam”}

Prompt engineering for cleaning the transcript

Now that the transcript has been diarized with speaker attributed names, we can use Amazon Bedrock to clean the transcript. Cleaning tasks include eliminating distracting filler words, identifying and rectifying misattributed homophones, and addressing any diarization errors stemming from subpar audio quality. For guidance on accomplishing these tasks using Anthropic Claude Sonnet models, see the provided prompt:

Human: You are a transcript editor, please follow the <instructions> tags.
<transcript> ${transcript} </transcript>
<instructions>
– The <transcript> contains a speaker diarized transcript
– Go over the transcript and remove all filler words. For example “um, uh, er, well, like, you know, okay, so, actually, basically, honestly, anyway, literally, right, I mean.”
– Fix any errors in transcription that may be caused by homophones based on the context of the sentence. For example, “one instead of won” or “high instead of hi”
– In addition, please fix the transcript in cases where diarization is improperly performed. For example, in some cases you will see that sentences are split between two speakers. In this case infer who the actual speaker is and attribute it to them.

– Please review the following example of this,

Input Example
Court: Adam you are saying the wrong thing. What
Adam: um do you mean, Court?

Output:
Court: Adam you are saying the wrong thing.
Adam: What do you mean, Court?

– In your response, return the entire cleaned transcript, including all of the filler word removal and the improved diarization. Only return the transcript, do not include any leading or trailing sentences. You are not summarizing. You are cleaning the transcript. Do not include any xml tags <>
</instructions>
Assistant:

Transcript Processing

After the initial transcript is passed into the LLM, it returns a polished transcript, free from errors. The following are excerpts from the transcript:

Speaker Identification

Input:

spk_0: Hey Court , how’s it going ?
spk_1: Hey Adam, it’s going good . How are you
spk_0: doing ? Well , uh hey , thanks for uh joining me today on this call . I’m excited to talk to you about uh architecting on Aws .

Output:

Adam: Hey Court, how’s it going?
Court: Hey Adam, it’s going good. How are you?
Adam: Thanks for joining me today. I’m excited to talk to you about architecting on AWS

Homophone Replacement

Input:

Adam: Have you ex have you experienced uh my sequel attacks in the past or sequel injections ?
Court: Yes .

Output:

Adam: Have you experienced SQL injections in the past?
Court: Yes.

Filler Word Removal

Input:

Adam: Ok , great . Uh All right . And then are you guys bound by any compliance standards like PC I DS S or um you know GDR ? Uh what’s another one ? Any of those C just uh
Court: we are bound by fate , moderate complaints . So .
Adam: Ok . Well , you have to transition to fed ramp high at any time

Output:

Adam: Ok, great. And then are you guys bound by any compliance standards like PCI DSS or GDPR? What’s another one? Any of those? CJIS?
Court: We are bound by FedRAMP moderate compliance.
Adam: Ok. Will you have to transition to FedRAMP high at any time?

Prompt engineering for summarization

Now that the transcript has been created by Amazon Transcribe, diarized, and enhanced with Amazon Bedrock, the transcript can be summarized with Amazon Bedrock and an LLM. A simple version might look like the following:

Human:
You are a transcript summarizing bot. You will go over the transcript below and provide a summary of the transcript.
Transcript: ${transcript}

Assistant:

Although this will work, it can be improved with additional instructions. You can use XML tags to provide structure and clarity to the task requirements:

Human: You are a transcript summarizing bot. You will go over the transcript below and provide a summary of the content within the <instructions> tags.
<transcript> ${transcript} </transcript>

<instructions>
– Go over the conversation that was had in the transcript.
– Create a summary based on what occurred in the meeting.
– Highlight specific action items that came up in the meeting, including follow-up tasks for each person.
</instructions>

Assistant:

Because meetings often involve action items and date/time information, instructions are added to explicitly request the LLM to include this information. Because the LLM knows the speaker names, each person is assigned specific action items for them if any are found. To prevent hallucinations, an instruction is included that allows the LLM to fail gracefully.

Human: You are a transcript summarizing bot. You will go over the transcript below and provide a summary of the content within the <instructions> tags.

<transcript> ${transcript} </transcript>

<instructions>
– Go over the conversation that was had in the transcript.
– Create a summary based on what occurred in the meeting.
– Highlight specific action items that came up in the meeting, including follow-up tasks for each person.
– If relevant, focus on what specific AWS services were mentioned during the conversation.
– If there’s sufficient context, infer the speaker’s role and mention it in the summary. For instance, “Bob, the customer/designer/sales rep/…”
– Include important date/time, and indicate urgency if necessary (e.g., deadline/ETAs for outcomes and next steps)
</instructions>

Assistant: Should I add anything else in my answer?

Human: If there is not enough context to generate a proper summary, then just return a string that says “Meeting not long enough to generate a transcript.

Assistant:

Alternatively, we invite you to explore Amazon Transcribe Call Analytics generative call summarization for an out-of-the-box solution that integrates directly with Amazon Transcribe.

Prompt engineering for meeting invitation parsing

Included in this demo is a React-based UI that will start the process of the SMA joining the meeting. Because this demo supports multiple meeting types, the invitation must be parsed. Rather than parse this with a complex regular expression (regex), it will be processed with an LLM. This prompt will first identify the meeting type: Amazon Chime, Zoom, Google Meet, Microsoft Teams, or Cisco Webex. Based on the meeting type, the LLM will extract the meeting ID and dial-in information. Simply copy/paste the meeting invitation to the UI, and the invitation will be processed by the LLM to determine how to call the meeting provider. This can be done for a meeting that is currently happening or scheduled for a future meeting. See the following example:

Human: You are an information extracting bot. Go over the meeting invitation below and determine what the meeting id and meeting type are <instructions></instructions> xml tags

<meeting_invitation>
${meetingInvitation}
</meeting_invitation>

<instructions>
1. Identify Meeting Type:
Determine if the meeting invitation is for Chime, Zoom, Google, Microsoft Teams, or Webex meetings.

2. Chime, Zoom, and Webex
– Find the meetingID
– Remove all spaces from the meeting ID (e.g., #### ## #### -> ##########).

3. If Google – Instructions Extract Meeting ID and Dial in
– For Google only, the meeting invitation will call a meetingID a ‘pin’, so treat it as a meetingID
– Remove all spaces from the meeting ID (e.g., #### ## #### -> ##########).
– Extract Google and Microsoft Dial-In Number (if applicable):
– If the meeting is a Google meeting, extract the unique dial-in number.
– Locate the dial-in number following the text “to join by phone dial.”
– Format the extracted Google dial-in number as (+1 ###-###-####), removing dashes and spaces. For example +1 111-111-1111 would become +11111111111)

4. If Microsoft Teams – Instructions if meeting type is Microsoft Teams.
– Pay attention to these instructions carefully
– The meetingId we want to store in the generated response is the ‘Phone Conference ID’ : ### ### ###
– in the meeting invitation, there are two IDs a ‘Meeting ID’ (### ### ### ##) and a ‘Phone Conference ID’ (### ### ###), ignore the ‘Meeting ID’ use the ‘Phone Conference ID’
– The meetingId we want to store in the generated response is the ‘Phone Conference ID’ : ### ### ###
– The meetingID that we want is referenced as the ‘Phone Conference ID’ store that one as the meeting ID.
– Find the phone number, extract it and store it as the dialIn number (format (+1 ###-###-####), removing dashes and spaces. For example +1 111-111-1111 would become +11111111111)

5. meetingType rules
– The only valid responses for meetingType are ‘Chime’, ‘Webex’, ‘Zoom’, ‘Google’, ‘Teams’

6. Generate Response:
– Create a response object with the following format:
{
meetingId: “meeting id goes here with spaces removed”,
meetingType: “meeting type goes here (options: ‘Chime’, ‘Webex’, ‘Zoom’, ‘Google’, ‘Teams’)”,
dialIn: “Insert Google/Microsoft Teams Dial-In number with no dashes or spaces, or N/A if not a Google/Microsoft Teams Meeting”
}

Meeting ID Formats:
Zoom: ### #### ####
Webex: #### ### ####
Chime: #### ## ####
Google: ### ### ####
Teams: ### ### ###

Ensure that the program does not create fake phone numbers and only includes the Microsoft or Google dial-in number if the meeting type is Google or Teams.

</instructions>

Assistant: Should I add anything else in my answer?

Human: Only return a JSON formatted response with the meetingid and meeting type associated to it. Do not add any other words to your answer. Do not add any introductory sentences in your answer.

Assistant:

With this information extracted from the invitation, a call is placed to the meeting provider so that the SMA can join the meeting as a participant.

Clean up

If you deployed this sample solution, clean up your resources by destroying the AWS CDK application from the AWS Command Line Interface (AWS CLI). This can be done using the following command:

yarn cdk destroy

Conclusion

In this post, we showed how to enhance Amazon Transcribe with an LLM using Amazon Bedrock by extracting information that would otherwise be difficult for a regex to extract. We also used this method to extract information from a meeting invitation sent from an unknown source. Finally, we showed how to use an LLM to provide a summarization of the meeting using detailed instructions to produce action items and include date/time information in the response.

We invite you to deploy this demo into your own account. We’d love to hear from you. Let us know what you think in the issues forum of the Amazon Chime SDK Meeting Summarizer GitHub repository. Alternatively, we invite you to explore other methods for meeting transcription and summarization, such as Amazon Live Meeting Assistant, which uses a browser extension to collect call audio.

About the authors

Adam Neumiller is a Solutions Architect for AWS. He is focused on helping public sector customers drive cloud adoption through the use of infrastructure as code. Outside of work, he enjoys spending time with his family and exploring the great outdoors.

Court Schuett is a Principal Specialist SA – GenAI focused on third party models and how they can be used to help solve customer problems. When he’s not coding, he spends his time exploring parks, traveling with his family, and listening to music.

Christopher Lott is a Principal Solutions Architect in the AWS AI Language Services team. He has 20 years of enterprise software development experience. Chris lives in Sacramento, California, and enjoys gardening, cooking, aerospace/general aviation, and traveling the world.

Hang Su is a Senior Applied Scientist at AWS AI. He has been leading AWS Transcribe Contact Lens Science team. His interest lies in call-center analytics, LLM-based abstractive summarization, and general conversational AI.

Jason Cai is an Applied Scientist at AWS AI. He has made contributions to AWS Bedrock, Contact Lens, Lex and Transcribe. His interests include LLM agents, dialogue summarization, LLM prediction refinement, and knowledge graph.

Edgar Costa Filho is a Senior Cloud Infrastructure Architect with a focus on Foundations and Containers, including expertise in integrating Amazon EKS with open-source tooling like Crossplane, Terraform, and GitOps. In his role, Edgar is dedicated to assisting customers in achieving their business objectives by implementing best practices in cloud infrastructure design and management.

Implement serverless semantic search of image and live video with Amazon Titan Multimodal Embeddings

June 3, 2024

by Thorben Sanktjohanser Amazon AWS

In today’s data-driven world, industries across various sectors are accumulating massive amounts of video data through cameras installed in their warehouses, clinics, roads, metro stations, stores, factories, or even private facilities. This video data holds immense potential for analysis and monitoring of incidents that may occur in these locations. From fire hazards to broken equipment, theft, or accidents, the ability to analyze and understand this video data can lead to significant improvements in safety, efficiency, and profitability for businesses and individuals.

This data allows for the derivation of valuable insights when combined with a searchable index. However,traditional video analysis methods often rely on manual, labor-intensive processes, making it challenging to scale and efficient. In this post, we introduce semantic search, a technique to find incidents in videos based on natural language descriptions of events that occurred in the video. For example, you could search for “fire in the warehouse” or “broken glass on the floor.” This is where multi-modal embeddings come into play. We introduce the use of the Amazon Titan Multimodal Embeddings model, which can map visual as well as textual data into the same semantic space, allowing you to use textual description and find images containing that semantic meaning. This semantic search technique allows you to analyze and understand frames from video data more effectively.

We walk you through constructing a scalable, serverless, end-to-end semantic search pipeline for surveillance footage with Amazon Kinesis Video Streams, Amazon Titan Multimodal Embeddings on Amazon Bedrock, and Amazon OpenSearch Service. Kinesis Video Streams makes it straightforward to securely stream video from connected devices to AWS for analytics, machine learning (ML), playback, and other processing. It enables real-time video ingestion, storage, encoding, and streaming across devices. Amazon Bedrock is a fully managed service that provides access to a range of high-performing foundation models from leading AI companies through a single API. It offers the capabilities needed to build generative AI applications with security, privacy, and responsible AI. Amazon Titan Multimodal Embeddings, available through Amazon Bedrock, enables more accurate and contextually relevant multimodal search. It processes and generates information from distinct data types like text and images. You can submit text, images, or a combination of both as input to use the model’s understanding of multimodal content. OpenSearch Service is a fully managed service that makes it straightforward to deploy, scale, and operate OpenSearch. OpenSearch Service allows you to store vectors and other data types in an index, and offers sub second query latency even when searching billions of vectors and measuring the semantical relatedness, which we use in this post.

We discuss how to balance functionality, accuracy, and budget. We include sample code snippets and a GitHub repo so you can start experimenting with building your own prototype semantic search solution.

Overview of solution

The solution consists of three components:

First, you extract frames of a live stream with the help of Kinesis Video Streams (you can optionally extract frames of an uploaded video file as well using an AWS Lambda function). These frames can be stored in an Amazon Simple Storage Service (Amazon S3) bucket as files for later processing, retrieval, and analysis.
In the second component, you generate an embedding of the frame using Amazon Titan Multimodal Embeddings. You store the reference (an S3 URI) to the actual frame and video file, and the vector embedding of the frame in OpenSearch Service.
Third, you accept a textual input from the user to create an embedding using the same model and use the API provided to query your OpenSearch Service index for images using OpenSearch’s intelligent vector search capabilities to find images that are semantically similar to your text based on the embeddings generated by the Amazon Titan Multimodal Embeddings model.

This solution uses Kinesis Video Streams to handle any volume of streaming video data without consumers provisioning or managing any servers. Kinesis Video Streams automatically extracts images from video data in real time and delivers the images to a specified S3 bucket. Alternatively, you can use a serverless Lambda function to extract frames of a stored video file with the Python OpenCV library.

The second component converts these extracted frames into vector embeddings directly by calling the Amazon Bedrock API with Amazon Titan Multimodal Embeddings.

Embeddings are a vector representation of your data that capture semantic meaning. Generating embeddings of text and images using the same model helps you measure the distance between vectors to find semantic similarities. For example, you can embed all image metadata and additional text descriptions into the same vector space. Close vectors indicate that the images and text are semantically related. This allows for semantic image search—given a text description, you can find relevant images by retrieving those with the most similar embeddings, as represented in the following visualization.

Starting December 2023, you can use the Amazon Titan Multimodal Embeddings model for use cases like searching images by text, image, or a combination of text and image. It produces 1,024-dimension vectors (by default), enabling highly accurate and fast search capabilities. You can also configure smaller vector sizes to optimize for cost vs. accuracy. For more information, refer to Amazon Titan Multimodal Embeddings G1 model.

The following diagram visualizes the conversion of a picture to a vector representation. You split the video files into frames and save them in a S3 bucket (Step 1). The Amazon Titan Multimodal Embeddings model converts these frames into vector embeddings (Step 2). You store the embeddings of the video frame as a k-nearest neighbors (k-NN) vector in your OpenSearch Service index with the reference to the video clip and the frame in the S3 bucket itself (Step 3). You can add additional descriptions in an additional field.

The following diagram visualizes the semantic search with natural language processing (NLP). The third component allows you to submit a query in natural language (Step 1) for specific moments or actions in a video, returning a list of references to frames that are semantically similar to the query. The Amazon Titan MultimodalEmbeddings model (Step 2) converts the submitted text query into a vector embedding (Step 3). You use this embedding to look up the most similar embeddings (Step 4). The stored references in the returned results are used to retrieve the frames and video clip to the UI for replay (Step 5).

The following diagram shows our solution architecture.

The workflow consists of the following steps:

You stream live video to Kinesis Video Streams. Alternatively, upload existing video clips to an S3 bucket.
Kinesis Video Streams extracts frames from the live video to an S3 bucket. Alternatively, a Lambda function extracts frames of the uploaded video clips.
Another Lambda function collects the frames and generates an embedding with Amazon Bedrock.
The Lambda function inserts the reference to the image and video clip together with the embedding as a k-NN vector into an OpenSearch Service index.
You submit a query prompt to the UI.
A new Lambda function converts the query to a vector embedding with Amazon Bedrock.
The Lambda function searches the OpenSearch Service image index for any frames matching the query and the k-NN for the vector using cosine similarity and returns a list of frames.
The UI displays the frames and video clips by retrieving the assets from Kinesis Video Streams using the saved references of the returned results. Alternatively, the video clips are retrieved from the S3 bucket.

This solution was created with AWS Amplify. Amplify is a development framework and hosting service that assists frontend web and mobile developers in building secure and scalable applications with AWS tools quickly and efficiently.

Optimize for functionality, accuracy, and cost

Let’s conduct an analysis of this proposed solution architecture to determine opportunities for enhancing functionality, improving accuracy, and reducing costs.

Starting with the ingestion layer, refer to Design considerations for cost-effective video surveillance platforms with AWS IoT for Smart Homes to learn more about cost-effective ingestion into Kinesis Video Streams.

The extraction of video frames in this solution is configured using Amazon S3 delivery with Kinesis Video Streams. A key trade-off to evaluate is determining the optimal frame rate and resolution to meet the use case requirements balanced with overall system resource utilization. The frame extraction rate can range from as high as five frames per second to as low as one frame every 20 seconds. The choice of frame rate can be driven by the business use case, which directly impacts embedding generation and storage in downstream services like Amazon Bedrock, Lambda, Amazon S3, and the Amazon S3 delivery feature, as well as searching within the vector database. Even when uploading pre-recorded videos to Amazon S3, thoughtful consideration should still be given to selecting an appropriate frame extraction rate and resolution. Tuning these parameters allows you to balance your use case accuracy needs with consumption of the mentioned AWS services.

The Amazon Titan Multimodal Embeddings model outputs a vector representation with an default embedding length of 1,024 from the input data. This representation carries the semantic meaning of the input and is best to compare with other vectors for optimal similarity. For best performance, it’s recommended to use the default embedding length, but it can have direct impact on performance and storage costs. To increase performance and reduce costs in your production environment, alternate embedding lengths can be explored, such as 256 and 384. Reducing the embedding length also means losing some of the semantic context, which has a direct impact on accuracy, but improves the overall speed and optimizes the storage costs.

OpenSearch Service offers on-demand, reserved, and serverless pricing options with general purpose or storage optimized machine types to fit different workloads. To optimize costs, you should select reserved instances to cover your production workload base, and use on-demand, serverless, and convertible reservations to handle spikes and non-production loads. For lower-demand production workloads, a cost-friendly alternate option is using pgvector with Amazon Aurora PostgreSQL Serverless, which offers lower base consumption units as compared to Amazon OpenSearch Serverless, thereby lowering the cost.

Determining the optimal value of K in the k-NN algorithm for vector similarity search is significant for balancing accuracy, performance, and cost. A larger K value generally increases accuracy by considering more neighboring vectors, but comes at the expense of higher computational complexity and cost. Conversely, a smaller K leads to faster search times and lower costs, but may lower result quality. When using the k-NN algorithm with OpenSearch Service, it’s essential to carefully evaluate the K parameter based on your application’s priorities—starting with smaller values like K=5 or 10, then iteratively increasing K if higher accuracy is needed.

As part of the solution, we recommend Lambda as the serverless compute option to process frames. With Lambda, you can run code for virtually any type of application or backend service—all with zero administration. Lambda takes care of everything required to run and scale your code with high availability.

With high amounts of video data, you should consider binpacking your frame processing tasks and running a batch computing job to access a large amount of compute resources. The combination of AWS Batch and Amazon Elastic Container Service (Amazon ECS) can efficiently provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce compute costs, and deliver results quickly.

You will incur costs when deploying the GitHub repo in your account. When you are finished examining the example, follow the steps in the Clean up section later in this post to delete the infrastructure and stop incurring charges.

Refer to the README file in the repository to understand the building blocks of the solution in detail.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account with sufficient AWS Identity and Access Management (IAM) permissions
Model access enabled for Amazon Bedrock Titan Multimodal Embeddings G1
The AWS Command Line Interface (AWS CLI) installed
The AWS Amplify CLI set up

Deploy the Amplify application

Complete the following steps to deploy the Amplify application:

Clone the repository to your local disk with the following command:

git clone https://github.com/aws-samples/Serverless-Semantic-Video-Search-Vector-Database-and-a-Multi-Modal-Generative-Al-Embeddings-Model

Change the directory to the cloned repository.
Initialize the Amplify application:
```
amplify init
```
Clean install the dependencies of the web application:
```
npm ci
```
Create the infrastructure in your AWS account:
```
amplify push
```
Run the web application in your local environment:
```
npm run dev
```

Create an application account

Complete the following steps to create an account in the application:

Open the web application with the stated URL in your terminal.
Enter a user name, password, and email address.
Confirm your email address with the code sent to it.

Upload files from your computer

Complete the following steps to upload image and video files stored locally:

Choose File Upload in the navigation pane.
Choose Choose files.
Select the images or videos from your local drive.
Choose Upload Files.

Upload files from a webcam

Complete the following steps to upload images and videos from a webcam:

Choose Webcam Upload in the navigation pane.
Choose Allow when asked for permissions to access your webcam.
Choose to either upload a single captured image or a captured video:
1. Choose Capture Image and Upload Image to upload a single image from your webcam.
2. Choose Start Video Capture, Stop Video Capture, and finally
  Upload Video to upload a video from your webcam.

Search videos

Complete the following steps to search the files and videos you uploaded.

Choose Search in the navigation pane.
Enter your prompt in the Search Videos text field. For example, we ask “Show me a person with a golden ring.”
Lower the confidence parameter closer to 0 if you see fewer results than you were originally expecting.

The following screenshot shows an example of our results.

Clean up

Complete the following steps to clean up your resources:

Open a terminal in the directory of your locally cloned repository.
Run the following command to delete the cloud and local resources:
```
amplify delete
```

Conclusion

A multi-modal embeddings model has the potential to revolutionize the way industries analyze incidents captured with videos. AWS services and tools can help industries unlock the full potential of their video data and improve their safety, efficiency, and profitability. As the amount of video data continues to grow, the use of multi-modal embeddings will become increasingly important for industries looking to stay ahead of the curve. As innovations like Amazon Titan foundation models continue maturing, they will reduce the barriers to use advanced ML and simplify the process of understanding data in context. To stay updated with state-of-the-art functionality and use cases, refer to the following resources:

About the Authors

Thorben Sanktjohanser is a Solutions Architect at Amazon Web Services supporting media and entertainment companies on their cloud journey with his expertise. He is passionate about IoT, AI/ML and building smart home devices. Almost every part of his home is automated, from light bulbs and blinds to vacuum cleaning and mopping.

Talha Chattha is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Stockholm, serving key customers across EMEA. Talha holds a deep passion for generative AI technologies. He works tirelessly to deliver innovative, scalable, and valuable ML solutions in the space of large language models and foundation models for his customers. When not shaping the future of AI, he explores scenic European landscapes and delicious cuisines.

Victor Wang is a Sr. Solutions Architect at Amazon Web Services, based in San Francisco, CA, supporting innovative healthcare startups. Victor has spent 6 years at Amazon; previous roles include software developer for AWS Site-to-Site VPN, AWS ProServe Consultant for Public Sector Partners, and Technical Program Manager for Amazon RDS for MySQL. His passion is learning new technologies and traveling the world. Victor has flown over a million miles and plans to continue his eternal journey of exploration.

Akshay Singhal is a Sr. Technical Account Manager at Amazon Web Services, based in San Francisco Bay Area, supporting enterprise support customers focusing on the security ISV segment. He provides technical guidance for customers to implement AWS solutions, with expertise spanning serverless architectures and cost-optimization. Outside of work, Akshay enjoys traveling, Formula 1, making short movies, and exploring new cuisines.

Prioritizing employee well-being: An innovative approach with generative AI and Amazon SageMaker Canvas

June 3, 2024

by Rushabh Lokhande Amazon AWS

In today’s fast-paced corporate landscape, employee mental health has become a crucial aspect that organizations can no longer overlook. Many companies recognize that their greatest asset lies in their dedicated workforce, and each employee plays a vital role in collective success. As such, promoting employee well-being by creating a safe, inclusive, and supportive environment is of utmost importance.

However, quantifying and assessing mental health can be a daunting task. Traditional methods like employee well-being surveys or manual approaches may not always provide the most accurate or actionable insights. In this post, we explore an innovative solution that uses Amazon SageMaker Canvas for mental health assessment at the workplace.

We delve into the following topics:

The importance of mental health in the workplace
An overview of the SageMaker Canvas low-code no-code platform for building machine learning (ML) models
The mental health assessment model:
- Data preparation using the chat feature
- Training the model on SageMaker Canvas
- Model evaluation and performance metrics
Deployment and integration:
- Deploying the mental health assessment model
- Integrating the model into workplace wellness programs or HR systems

In this post, we use a dataset from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace, then we aggregate and prepare data for an ML model using Amazon SageMaker Data Wrangler for a tabular dataset on SageMaker Canvas. Then we train, build, test, and deploy the model using SageMaker Canvas, without writing any code.

Discover how SageMaker Canvas can revolutionize the way organizations approach employee mental health assessment, empowering them to create a more supportive and productive work environment. Stay tuned for insightful content that could reshape the future of workplace well-being.

Importance of mental health

Maintaining good mental health in the workplace is crucial for both employees and employers. In today’s fast-paced and demanding work environment, the mental well-being of employees can have a significant impact on productivity, job satisfaction, and overall company success. At Amazon, where innovation and customer obsession are at the core of our values, we understand the importance of fostering a mentally healthy workforce.

By prioritizing the mental well-being of our employees, we create an environment where they can thrive and contribute their best. This helps us deliver exceptional products and services. Amazon supports mental health by providing access to resources and support services. All U.S. employees and household members are eligible to receive five free counseling sessions, per issue every year, via Amazon’s Global Employee Assistance Program (EAP), Resources for Living. Employees can also access mental health care 24/7 through a partnership with the app Twill—a digital, self-guided mental health program. Amazon also partners with Brightline, a leading provider in virtual mental health support for children and teens.

Solution overview

SageMaker Canvas brings together a broad set of capabilities to help data professionals prepare, build, train, and deploy ML models without writing any code. SageMaker Data Wrangler has also been integrated into SageMaker Canvas, reducing the time it takes to import, prepare, transform, featurize, and analyze data. In a single visual interface, you can complete each step of a data preparation workflow: data selection, cleansing, exploration, visualization, and processing. Custom Spark commands can also expand the over 300 built-in data transformations. The built-in Data Quality and Insights report guides you in performing appropriate data cleansing, verifying data quality, and detecting anomalies such as duplicate rows and target leakage. Other analyses are also available to help you visualize and understand your data.

In this post, we try to understand the factors contributing to the mental health of an employee in the tech industry in a systematic manner. We begin by understanding the feature columns, presented in the following table.

Survey Attribute	Survey Attribute Description
`Timestamp`	Timestamp when survey was taken
`Age`	Age of person taking survey
`Gender`	Gender of person taking survey
`Country`	Country of person taking survey
`state`	If you live in the United States, which state or territory do you live in?
`self_employed`	Are you self-employed?
`family_history`	Do you have a family history of mental illness?
`treatment`	Have you sought treatment for a mental health condition?
`work_interfere`	If you have a mental health condition, do you feel that it interferes with your work?
`no_employees`	How many employees does your company or organization have?
`remote_work`	Do you work remotely (outside of an office) at least 50% of the time?
`tech_company`	Is your employer primarily a tech company/organization?
`benefits`	Does your employer provide mental health benefits?
`care_options`	Do you know the options for mental health care your employer provides?
`wellness_program`	Has your employer ever discussed mental health as part of an employee wellness program?
`seek_help`	Does your employer provide resources to learn more about mental health issues and how to seek help?
`anonymity`	Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
`leave`	How easy is it for you to take medical leave for a mental health condition?
`mentalhealthconsequence`	Do you think that discussing a mental health issue with your employer would have negative consequences?
`physhealthconsequence`	Do you think that discussing a physical health issue with your employer would have negative consequences?
`coworkers`	Would you be willing to discuss a mental health issue with your coworkers?
`physhealthinterview`	Would you bring up a physical health issue with a potential employer in an interview?
`mentalvsphysical`	Do you feel that your employer takes mental health as seriously as physical health?
`obs_consequence`	Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
`comments`	Any additional notes or comments

Prerequisites

You should complete the following prerequisites before building this model:

Have access to an AWS account
Set up a SageMaker domain
Download the sample dataset

Log in to SageMaker Canvas

When the initial setup is complete, you can access SageMaker Canvas with any of the following methods, depending on your environment’s setup:

On the SageMaker console, choose Canvas in the navigation pane. Choose your user from the dropdown menu and launch the SageMaker Canvas application.
To use Amazon SageMaker Studio, navigate to the SageMaker Studio interface. Go to the SageMaker Canvas page and launch the SageMaker Canvas application.
You can also use your organization’s SAML 2.0-based single sign-on (SSO) methods, such as AWS IAM Identity Center. To learn more, see Secure access to Amazon SageMaker Studio with AWS SSO and a SAML application.

Import the dataset into SageMaker Canvas

In SageMaker Canvas, you can see quick actions to get started building and using ML and generative artificial intelligence (AI) models, with a no code platform. Feel free to explore any of the out-of-the-box models.

We start from creating a data flow. A data flow in SageMaker Canvas is used to build a data preparation pipeline that can be scheduled to automatically import, prepare, and feed into a model build. With a data flow, you can prepare data using generative AI, over 300 built-in transforms, or custom Spark commands.

Complete the following steps:

Choose Prepare and analyze data.
For Data flow name, enter a name (for example, AssessingMentalHealthFlow).
Choose Create.

SageMaker Data Wrangler will open.

You can import data from multiple sources, ranging from AWS services, such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift, to third-party or partner services, including Snowflake or Databricks. To learn more about importing data to SageMaker Canvas, see Import data into Canvas.

Choose Import data, then choose Tabular.
Upload the dataset you downloaded in the prerequisites section.

After a successful import, you will be presented with a preview of the data, which you can browse.

Choose Import data to finish this step.

Run a Data Quality and Insights report

After you import the dataset, the SageMaker Data Wrangler data flow will open. You can run a Data Quality and Insights Report, which will perform an analysis of the data to determine potential issues to address during data preparation. Complete the following steps:

Choose Run Data quality and insights report.

For Analysis name, enter a name.
For Target column, choose treatment.
For Problem type, select Classification.
For Data size, choose Sampled dataset.
Choose Create.

You are presented with the generated report, which details any high priority warnings, data issues, and other insights to be aware of as you add data transformations and move along the model building process.

In this specific dataset, we can see that there are 27 features of different types, very little missing data, and no duplicates. To dive deeper into the report, refer to Get Insights On Data and Data Quality. To learn about other available analyzes, see Analyze and Visualize.

Prepare your data

As expected in the ML process, your dataset may require transformations to address issues such as missing values, outliers, or perform feature engineering prior to model building. SageMaker Canvas provides ML data transforms to clean, transform, and prepare your data for model building without having to write code. The transforms used are added to the model recipe, a record of the data preparation done on your data before building the model. You can refer to these advanced transformations and add them as transformation steps within your Data Wrangler flow.

Alternatively, you can use SageMaker Canvas to chat with your data and add transformations. We explore this option with some examples on our sample dataset.

Use the chat feature for exploratory analysis and building transformations

Before you use the chat feature to prepare data, note the following:

Chat for data prep requires the AmazonSageMakerCanvasAIServicesAccess policy. For more information, see AWS managed policy: AmazonSageMakerCanvasAIServicesAccess.
Chat for data prep requires access to Amazon Bedrock and the Anthropic Claude v2 model within it. For more information, see Model access.
You must run SageMaker Canvas data prep in the same AWS Region as the Region where you’re running your model. Chat for data prep is available in the US East (N. Virginia), US West (Oregon), and Europe (Frankfurt) Regions.

To chat with your data, complete the following steps:

Open your SageMaker Canvas data flow.
Open your dataset by choosing Source or Data types.

Choose Chat for data prep and specify your prompts in the chat window.

Optionally, if an analysis has been generated by your query, choose Add to analyses to reference it for later.
Optionally, if you’ve transformed your data using a prompt, do the following:

Choose Preview to view the results.
Optionally modify the code in the transform and choose Update.
If you’re happy with the results of the transform, choose Add to steps to add it to the steps pane.

Let’s try a few exploratory analyses and transformations through the chat feature.

In the following example, we ask “How many rows does the dataset have?”

In the following example, we drop the columns Timestamp, Country, state, and comments, because these features will have least impact for classification of our model. Choose View code to see the generated Spark code that performs the transformation, then choose Add to steps to add the transformation to the data flow.

You can provide a name and choose Update to save the data flow.

In the next example, we ask “Show me all unique ages sorted.”

Some ages are negative, so we should filter on valid ages. We drop rows with age below 0 or more than 100 and add this to the steps.

In the following example, we ask “Create a bar chart for null values in the dataset.”

Then we ask for a bar chart for the treatment column.

In the following example, we ask for a bar chart for the work_interfere column.

In the column work_interfere, we replace the NA values with “Don’t know.” We want to make the model weight missing values just as it weights people that have replied “Don’t know.”

For the column self_employed, we want to replace NA with “No” to make the model weight missing values just as it weights people that have replied “NA.”

You can choose to add any other transformations as needed. If you’ve followed the preceding transformations, your steps should look like the following screenshot.

Perform an analysis on the transformed data

Now that transformations have been done on the data, you may want to perform analyses to make sure they haven’t affected data integrity.

To do so, navigate to the Analyses tab to create an analysis. For this example, we create a feature correlation analysis with the correlation type linear.

The analysis report will generate a correlation matrix. The correlation matrix measures the positive or negative correlation of features among themselves, between each other. A value closer to 1 means positive correlation, and a value closer to -1 means negative correlation.

Linear feature correlation is based on Pearson’s correlation. To find the relationship between a numeric variable (like age or income) and a categorical variable (like gender or education level), we first assign numeric values to the categories in a way that allows them to best predict the numeric variable. Then we calculate the correlation coefficient, which measures how strongly the two variables are related.

Linear categorical to categorical correlation is not supported.

Numeric to numeric correlation is in the range [-1, 1], where 0 implies no correlation, 1 implies perfect correlation, and -1 implies perfect inverse correlation. Numeric to categorical and categorical to categorical correlations are in the range [0, 1], where 0 implies no correlation and 1 implies perfect correlation.

Features that are not either numeric or categorical are ignored.

The following table lists for each feature what is the most correlated feature to it.

Feature	Most Correlated Feature	Correlation
`Age` (numeric)	Gender (categorical)	0.248216
`Gender` (categorical)	Age (numeric)	0.248216
`seek_help` (categorical)	Age (numeric)	0.175808
`no_employees` (categorical)	Age (numeric)	0.166486
`benefits` (categorical)	Age (numeric)	0.157729
`remote_work` (categorical)	Age (numeric)	0.139105
`care_options` (categorical)	Age (numeric)	0.1183
`wellness_program` (categorical)	Age (numeric)	0.117175
`phys_health_consequence` (categorical)	Age (numeric)	0.0961159
`work_interfere` (categorical)	Age (numeric)	0.0797424
`treatment` (categorical)	Age (numeric)	0.0752661
`mental_health_consequence` (categorical)	Age (numeric)	0.0687374
`obs_consequence` (categorical)	Age (numeric)	0.0658778
`phys_health_interview` (categorical)	Age (numeric)	0.0639178
`self_employed` (categorical)	Age (numeric)	0.0628861
`tech_company` (categorical)	Age (numeric)	0.0609773
`leave` (categorical)	Age (numeric)	0.0601671
`mental_health_interview` (categorical)	Age (numeric)	0.0600251
`mental_vs_physical` (categorical)	Age (numeric)	0.0389857
`anonymity` (categorical)	Age (numeric)	0.038797
`coworkers` (categorical)	Age (numeric)	0.0181036
`supervisor` (categorical)	Age (numeric)	0.0167315
`family_history` (categorical)	Age (numeric)	0.00989271

The following figure shows our correlation matrix.

You can explore more analyses of different types. For more details, see Explore your data using visualization techniques.

Export the dataset and create a model

Return to the main data flow and run the SageMaker Data Wrangler validation flow. Upon successful validation, you are ready to export the dataset for model training.

Next, you export your dataset and build an ML model on top of it. Complete the following steps:

Open the expanded menu in the final transformation and choose Create model.

For Dataset name, enter a name.
Choose Export.

At this point, your mental health assessment dataset is ready for model training and testing.

Choose Create model.

For Model name, enter a name.
For Problem type, select Predictive analysis.

SageMaker Canvas suggested this based on the dataset, but you can override this for your own experimentation. For more information about ready-to-use models provided by SageMaker Canvas, see Use Ready-to-use models.

Choose Create.

For Target column, choose treatment as the column to predict.

Because Yes or No is predicted, SageMaker Canvas detected this is a two-category prediction model.

Choose Configure model to set configurations.

For Objective metric, leave as the default F1.

F1 averages two important metrics: precision and recall.

For Training method, select Auto.

This option selects the algorithm most relevant to your dataset and the best range of hyperparameters to tune model candidates. Alternatively, you could use the ensemble or hyperparameter optimization training options. For more information, see Training modes and algorithm support.

For Data split, specify an 80/20 configuration for training and validation, respectively.

Choose Save and then Preview model to generate a preview.

This preview runs on subset of data and provides information on estimated model accuracy and feature importance. Based on the results, you may still apply additional transformations to improve the estimated accuracy.

Although low impact features might add noise to the model, these may still be useful to describe situations specific to your use case. Always combine predictive power with your own context to determine which features to include.

You’re now ready to build the full model with either Quick build or Standard build. Quick build only supports datasets with fewer than 50,000 rows and prioritizes speed over accuracy, training fewer combinations of models and hyperparameters, for rapid prototyping or proving out value. Standard build prioritizes accuracy and is necessary for exporting the full Jupyter notebook used for training.

For this post, choose Standard build.

To learn more about how SageMaker Canvas uses training and validation datasets, see Evaluating Your Model’s Performance in Amazon SageMaker Canvas and SHAP Baselines for Explainability.

Your results may differ from those in this post. Machine learning introduces stochasticity in the model training process, which can lead to slight variations.

Here, we’ve built a model that will predict with about 87% accuracy whether an individual will seek mental health treatment. At this stage, think about how you could achieve a practical impact from the Machine Learning model. For example, here an organization may consider how they can apply the model to preemptively support individuals who’s attributes suggest they would seek treatment.

Review model metrics

Let’s focus on the first tab, Overview. Here, Column impact is the estimated importance of each attribute in predicting the target. Information here can help organizations gain insights that lead to actions based on the model. For example, we see that the work_interfere column has the most significant impact in predication for treatment. Additionally, better benefits and care_options increase the likelihood of employees opting in to treatment.

On the Scoring tab, we can visualize a Sankey (or ribbon) plot of the distribution of predicted values with respect to actual values, providing insight into how the model performed during validation.

For more detailed insights, we look at the Advanced metrics tab for metric values the model may have not been optimized for, the confusion matrix, and precision recall curve.

The advanced metrics suggest we can trust the resulting model. False positives (predicting an employee will opt in for treatment when they actually don’t) and false negatives (predicting an employee will opt out when they actually opt in) are low. High numbers for either may make us skeptical about the current build and more likely to revisit previous steps.

Test the model

Now let’s use the model for making predictions. Choose Predict to navigate to the Predict tab. SageMaker Canvas allows you to generate predictions in two forms:

Single prediction (single “what-if scenario”)
Batch prediction (multiple scenarios using a CSV file)

For a first test, let’s try a single prediction. Wait a few seconds for the model to load, and now you’re ready to generate new inferences. You can change the values to experiment with the attributes and their impact.

For example, let’s make the following updates:

Change work_interfere from Often to Sometimes
Change benefits from Yes to No

Choose Update and see if the treatment prediction is affected.

In SageMaker Canvas, you can generate batch predictions either manually or automatically on a schedule. Let’s try the manual approach. To learn about automating batch predictions, refer to Automate batch predictions.

In practice, use a dataset different from training for testing predictions. For this example though, lets use the same file as before. Be sure to remove the work_interfere column.
Choose Batch prediction and upload the downloaded file.
Choose Generate predictions.
When it’s complete, choose View to see the predictions.

Deploy the model

The final (optional) step of the SageMaker Canvas workflow for ML models is deploying the model. This uses SageMaker real-time inference endpoints to host the SageMaker Canvas model and expose an HTTPS endpoint for use by applications or developers.

On the Deploy tab, choose Create deployment.
For Deployment name, enter a name.
For Instance type, choose an instance (for this post, ml.m5.2xlarge).
Set Instance count to 1.
Choose Deploy.

This instance configuration is sufficient for the demo. You can change the configuration later from the SageMaker Canvas UI or using SageMaker APIs. To learn more about auto scaling such workloads, see Automatically Scale Amazon SageMaker Models.

After the deployment is successful, you can invoke the endpoint using AWS SDKs or direct HTTPs calls. For more information, see Deploy models for real-time inference.

To learn more about model deployment, refer to Deploy your Canvas models to a SageMaker Endpoint and Deploy models for real-time inference.

Clean up

Make sure to log out from SageMaker Canvas by choosing Log out. Logging out of the SageMaker Canvas application will release all resources used by the workspace instance, therefore avoiding incurring additional unintended charges.

Summary

Mental health is a dynamic and evolving field, with new research and insights constantly emerging. Staying up to date with the latest developments and best practices can be challenging, especially in a public forum. Additionally, when discussing mental health, it’s essential to approach the topic with sensitivity, respect, and a commitment to providing accurate and helpful information.

In this post, we showcased an ML approach to building a mental health model using a sample dataset and SageMaker Canvas, a low-code no-code platform from AWS. This can serve as guidance for organizations looking to explore similar solutions for their specific needs. Implementing AI to assess employee mental health and offer preemptive support can yield a myriad of benefits. By promoting detection of potential mental health needs, intervention can be more personalized and reduce the risk of drastic complications in the future. A proactive approach can also enhance employee morale and productivity, mitigating the likelihood of absenteeism, turnover and ultimately leads to a healthier and more resilient workforce.. Overall, using AI for mental health prediction and support signifies a commitment to nurturing a supportive work environment where employees can thrive.

To explore more about SageMaker Canvas with industry-specific use cases, explore a hands-on workshop. To learn more about SageMaker Data Wrangler in SageMaker Canvas, refer to Prepare Data. You can also refer to the following YouTube video to learn more about the end-to-end ML workflow with SageMaker Canvas.

Although this post provides a technical perspective, we strongly encourage readers who are struggling with mental health issues to seek professional help. Remember, there is always help available for those who ask.

Together, let’s take a proactive step towards empowering mental health awareness and supporting those in need.

About the Authors

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, analytics solutions, and generative AI implementations. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data analytics solutions and generative AI implementations. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Ryan Gomes is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He is passionate about helping customers achieve better outcomes through analytics, machine learning, and generative AI solutions in the cloud. Outside of work, he enjoys fitness, cooking, and spending quality time with friends and family.

How Project P.I. helps Amazon remove imperfect products

June 3, 2024

by Amazon AWS

A combination of generative AI and computer vision imaging tunnels is helping Amazon proactively improve the customer experience.Read More

Pre-training genomic language models using AWS HealthOmics and Amazon SageMaker

May 31, 2024

by Shamika Ariyawansa Amazon AWS

Genomic language models are a new and exciting field in the application of large language models to challenges in genomics. In this blog post and open source project, we show you how you can pre-train a genomics language model, HyenaDNA, using your genomic data in the AWS Cloud. Here, we use AWS HealthOmics storage as a convenient and cost-effective omic data store and Amazon Sagemaker as a fully managed machine learning (ML) service to train and deploy the model.

Genomic language models

Genomic language models represent a new approach in the field of genomics, offering a way to understand the language of DNA. These models use the transformer architecture, a type of natural language processing (NLP), to interpret the vast amount of genomic information available, allowing researchers and scientists to extract meaningful insights more accurately than with existing in silico approaches and more cost-effectively than with existing in situ techniques.

By bridging the gap between raw genetic data and actionable knowledge, genomic language models hold immense promise for various industries and research areas, including whole-genome analysis, delivered care, pharmaceuticals, and agriculture. They facilitate the discovery of novel gene functions, the identification of disease-causing mutations, and the development of personalized treatment strategies, ultimately driving innovation and advancement in genomics-driven fields. The ability to effectively analyze and interpret genomic data at scale is the key to precision medicine, agricultural optimization, and biotechnological breakthroughs, making genomic language models a possible new foundational technology in these industries.

Some of the pioneering genomic language models include

DNABERT which was one of the first attempts to use the transformer architecture to learn the language of DNA. DNABERT used a Bidirectional Encoder Representations from Transformers (BERT, encoder-only) architecture pre-trained on a human reference genome and showed promising results on downstream supervised tasks.
Nucleotide transformer has a similar architecture to DNABERT and showed that pre-training on more data and increasing the context window size improves the model’s accuracy on downstream tasks.
HyenaDNA uses the transformer architecture, like other genomic models, except that it replaces each self-attention layer with a Hyena operator. This widens the context window to allow processing of up to 1 million tokens, substantially more than prior models, allowing it to learn longer-range interactions in DNA.

In our exploration of cutting-edge models that push the boundaries of genetic sequence analysis, we focused on HyenaDNA. Pretrained HyenaDNA models are readily accessible on Hugging Face. This availability facilitates easy integration into existing projects or the starting point for new explorations in genetic sequence analysis.

AWS HealthOmics and sequence stores

AWS HealthOmics is a purpose-built service that helps healthcare and life science organizations and their software partners store, query, and analyze genomic, transcriptomic, and other omics data and then generate insights from that data to improve health and drive deeper biological understanding. It supports large-scale analysis and collaborative research through HealthOmics storage, analytics, and workflow capabilities.

With HealthOmics storage, a managed omics focused findable accessible, interoperable, and reusable (FAIR) data store, users can cost effectively store, organize, share, and access petabytes of bioinformatics data efficiently at a low cost per gigabase. HealthOmics sequence stores deliver cost savings through automatic tiering and compression of files based on usage, enable sharing and findability through the biologically focused metadata and provenance tracking, and provide instant access to frequently used data through low latency Amazon Simple Storage Service (Amazon S3) compatible APIs or HealthOmics native APIs. All of this is delivered by HealthOmics, removing the burden of managing compression, tiering, metadata, and file organization from customers.

Amazon SageMaker

Amazon SageMaker is a fully managed ML service offered by AWS, designed to reduce the time and cost associated with training and tuning ML models at scale.

With SageMaker Training, a managed batch ML compute service, users can efficiently train models without having to manage the underlying infrastructure. SageMaker notably supports popular deep learning frameworks, including PyTorch, which is integral to the solutions provided here.

SageMaker also provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs.

Solution overview

In this blog post we address pre-training a genomic language model on an assembled genome. This genomic data could be either public (for example, GenBank) or could be your own proprietary data. The following diagram illustrates the workflow:

We start with genomic data. For the purposes of this blog post, we’re using a public non-reference Mouse genome from GenBank. The dataset is part of The Mouse Genomes Project and represents a consensus genome sequence of inbred mouse strains. This type of genomic data could readily be interchanged with proprietary datasets that you might be working with in your research.
We use a SageMaker notebook to process the genomic files and to import these into a HealthOmics sequence store.
A second SageMaker notebook is used to start the training job on SageMaker.
Inside the managed training job in the SageMaker environment, the training job first downloads the mouse genome using the S3 URI supplied by HealthOmics.
Then the training job retrieves the checkpoint weights of the HyenaDNA model from Huggingface. These weights are pretrained on the human reference genome. This pretraining allows the model to understand and predict genomic sequences, providing a comprehensive baseline for further specialized training on a variety of genomic tasks.
Using these resources, the HyenaDNA model is trained, where it uses the mouse genome to refine its parameters. After pre-training is complete and validation results are satisfactory, the trained model is saved to Amazon S3.
Then we deploy that model as a SageMaker real-time inference endpoint.
Lastly the model is tested against a set of known genome sequences using some inference API calls.

Data preparation and loading into sequence store

The initial step in our machine learning workflow focuses on preparing the data. We start by uploading the genomic sequences into a HealthOmics sequence store. Although FASTA files are the standard format for storing reference sequences, we convert these to FASTQ format. This conversion is carried out to better reflect the format expected to store the assembled data of a sequenced sample.

In the sample Jupyter notebook we show how to download FASTA files from GenBank, convert them into FASTQ files, and then load them into a HealthOmics sequence store. You can skip this step If you already have your own genomic data in a sequence store.

Training on SageMaker

We use PyTorch and Amazon SageMaker script mode to train this model. Script mode’s compatibility with PyTorch was crucial, allowing us to use our existing scripts with minimal modifications. For the training, we extract the training data from the sequence store through the sequence store’s provided S3 URIs. You can, for example, use the boto3 library to obtain this S3 URI.

seq_store_id = "4308389581“

seq_store_info = omics.get_sequence_store(id=seq_store_id)
s3_uri = seq_store_info["s3Access"]["s3Uri"]
s3_arn = seq_store_info["s3Access"]["s3AccessPointArn"]
key_arn = seq_store_info["sseConfig"]["keyArn"]
s3_uri, s3_arn, key_arn

S3_DATA_URI = f"{s3_uri}readSet/"
S3_DATA_URI

When you provide this to the SageMaker estimator, the training job takes care of downloading the data from the sequence store through its S3 URI. Following Nguyen et al, we train on chromosomes 2, 4, 6, 8, X, and 14–19; cross-validate on chromosomes 1, 3, 12, and 13; and test on chromosomes 5, 7, and 9–11.

To maximize the training efficiency of our HyenaDNA model, we use distributed data parallel (DDP). DDP is a technique that facilitates the parallel processing of our training tasks across multiple GPUs. To efficiently implement DDP, we used the Hugging Face Accelerate library. Accelerate simplifies running distributed training by abstracting away the complexity typically associated with setting up DDP.

After you have defined your training script, you can configure and submit a SageMaker training job.

First, let’s define the hyperparameters, starting with model_checkpoint. This parameter refers to a HuggingFace model ID for a specific pre-trained model. Notably, the HyenaDNA model lineup includes checkpoints that can handle up to 1 million tokens. However, for demonstration purposes, we are using the hyenadna-small-32k-seqlen-hf model, which has a context window of 32,000 tokens, indicated by the max_length setting. It’s essential to understand that different model IDs and corresponding max_length settings can be selected to use models with smaller or larger context windows, depending on your computational needs and objectives.

The species parameter is set to mouse, specifying the type of organism the genomic training data represents.

hyperparameters = {
    "species" : "mouse",
    "epochs": 150,
    "model_checkpoint": MODEL_ID,
    "max_length": 32_000,
    "batch_size": 4,
    "learning_rate": 6e-4,
    "weight_decay" : 0.1,
    "log_level" : "INFO",
    "log_interval" : 100
}

Next, define what metrics, especially the training and validation perplexity, to capture from the training logs:

metric_definitions = [
    {"Name": "epoch", "Regex": "Epoch: ([0-9.]*)"},
    {"Name": "step", "Regex": "Step: ([0-9.]*)"},
    {"Name": "train_loss", "Regex": "Train Loss: ([0-9.e-]*)"},
    {"Name": "train_perplexity", "Regex": "Train Perplexity: ([0-9.e-]*)"},
    {"Name": "eval_loss", "Regex": "Eval Average Loss: ([0-9.e-]*)"},
    {"Name": "eval_perplexity", "Regex": "Eval Perplexity: ([0-9.e-]*)"}
]

Finally, define a Pytorch estimator and submit a training job that refers to the data location obtained from the HealthOmics sequence store.

hyenaDNA_estimator = PyTorch(
    base_job_name=TRAINING_JOB_NAME,
    entry_point="train_hf_accelerate.py",
    source_dir="scripts/",
    instance_type="ml.g5.12xlarge",
    instance_count=1,
    image_uri=pytorch_image_uri,
    role=SAGEMAKER_EXECUTION_ROLE,
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    sagemaker_session=sagemaker_session,
    distribution={"torch_distributed": {"enabled": True}},
    tags=[{"Key": "project", "Value": "genomics-model-pretraining"}],
    keep_alive_period_in_seconds=1800,
    tensorboard_output_config=tensorboard_output_config,
)

with Run(
    experiment_name=EXPERIMENT_NAME,
    sagemaker_session=sagemaker_session,
) as run:
    hyenaDNA_estimator.fit(
        {
            "data": TrainingInput(
                s3_data=S3_DATA_URI, input_mode="File"
            ),
        },
        wait=True,
    )

Results

In our training cycle for the model, we processed a dataset consisting of one mouse genome with 10,000 entries. The computational resources included a cluster configured with one ml.g5.12xlarge instance, which houses four Nvidia A10G GPUs. The 32k sequence length model, was trained using a batch size of four per GPU (24 gigabit (Gb) of VRAM). With this setup we completed 150 epochs to report the results below.

Evaluation metrics: The evaluation perplexity and loss graphs show a downward trend at the outset, which then plateaus. The initial steep decrease indicates that the model rapidly learned from the training data, improving its predictive performance. As training progressed, the rate of improvement slowed, as evidenced by the plateau, which is typical in the later stages of training as the model converges.

Training Metrics: Similarly, the training perplexity and loss graphs indicate an initial sharp improvement followed by a gradual plateau. This shows that the model effectively learned from the data. The training loss’s slight fluctuations suggest that the model continued to fine-tune its parameters in response to the inherent complexities in the training dataset.

Deployment

Upon the completion of training, we then deployed the model on a SageMaker real-time endpoint. SageMaker real-time endpoints provide an on-demand, scalable way to generate embeddings for genomic sequences.

In our SageMaker real-time endpoint setup, we need to adjust the default configurations to handle large payload sizes, specifically 32k context windows for both requests and responses. Because the default payload size of 6.5 MB isn’t sufficient, we’re increasing it to a little over 50 MB:

hyenaDNAModel = PyTorchModel(
    model_data=model_data,
    role=SAGEMAKER_EXECUTION_ROLE,
    image_uri=pytorch_deployment_uri,
    entry_point="inference.py",
    source_dir="scripts/",
    sagemaker_session=sagemaker_session,
    name=endpoint_name,
    env = {
        'TS_MAX_RESPONSE_SIZE':'60000000',
        'TS_MAX_REQUEST_SIZE':'60000000',
    }
)

# deploy the endpoint endpoint
realtime_predictor = hyenaDNAModel.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.8xlarge",
    endpoint_name=endpoint_name,
    env=env,
)

By submitting a sequence to the endpoint, users can quickly receive the corresponding embeddings generated by HyenaDNA. These embeddings encapsulate the complex patterns and relationships learned during training, representing the genetic sequences in a form that is conducive to further analysis and predictive modeling. Here is an example of how to invoke the model.

import json
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer

sample_genome_data = []
with open("./sample_mouse_data.json") as file:
    for line in file:
        sample_genome_data.append(json.loads(line))
len(sample_genome_data)

data = [sample_genome_data[0]]
realtime_predictor.serializer = JSONSerializer()
realtime_predictor.deserializer = JSONDeserializer()
realtime_predictor.predict(data=data)

When you submit a sample genomic sequence to the model, it returns the embeddings of that sequence:

{'embeddings': [[-0.50390625, 0.447265625,-1.03125, 0.546875, 0.50390625, -0.53125, 0.59375, 0.71875, 0.349609375, -0.404296875, -4.8125, 0.84375, 0.359375, 1.2265625,………]]}

Conclusion

We’ve shown how to pre-train a HyenaDNA model with a 32k context window and to produce embeddings that can be used for downstream predictive tasks. Using the techniques shown here you can also pre-train a HyenaDNA model with context windows of other sizes (for example, 1 million tokens) and on other genomic data (for example, proprietary genomic sequence data).

Pre-training genomic models on large, diverse datasets is a foundational step in preparing them for downstream tasks, such as identifying genetic variants linked to diseases or predicting gene expression levels. In this blog post, you’ve learned how AWS facilitates this pre-training process by providing a scalable and cost-efficient infrastructure through HealthOmics and SageMaker. Looking forward, researchers can use these pre-trained models to fast-track their projects, fine-tuning them with specific datasets to gain deeper insights into genetic research.

To explore further details and try your hand at using these resources, we invite you to visit our GitHub repository. Additionally, We encourage you to learn more by visiting the Amazon SageMaker documentation and the AWS HealthOmics documentation.

About the authors

Shamika Ariyawansa, serving as a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences division at Amazon Web Services (AWS), specializes in Generative AI. He assists customers in integrating Generative AI into their projects, emphasizing the adoption of Large Language Models (LLMs) for healthcare and life sciences domains with a focus on distributed training. Beyond his professional commitments, Shamika passionately pursues skiing and off-roading adventures.

Simon Handley, PhD, is a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He has more than 25 years experience in biotechnology and machine learning and is passionate about helping customers solve their machine learning and genomic challenges. In his spare time, he enjoys horseback riding and playing ice hockey.

Falcon 2 11B is now available on Amazon SageMaker JumpStart

May 31, 2024

by Supriya Puragundla Amazon AWS

Today, we are excited to announce that the first model in the next generation Falcon 2 family, the Falcon 2 11B foundation model (FM) from Technology Innovation Institute (TII), is available through Amazon SageMaker JumpStart to deploy and run inference.

Falcon 2 11B is a trained dense decoder model on a 5.5 trillion token dataset and supports multiple languages. The Falcon 2 11B model is available on SageMaker JumpStart, a machine learning (ML) hub that provides access to built-in algorithms, FMs, and pre-built ML solutions that you can deploy quickly and get started with ML faster.

In this post, we walk through how to discover, deploy, and run inference on the Falcon 2 11B model using SageMaker JumpStart.

What is the Falcon 2 11B model

Falcon 2 11B is the first FM released by TII under their new artificial intelligence (AI) model series Falcon 2. It’s a next generation model in the Falcon family—a more efficient and accessible large language model (LLM) that is trained on a 5.5 trillion token dataset primarily consisting of web data from RefinedWeb with 11 billion parameters. It’s built on causal decoder-only architecture, making it powerful for auto-regressive tasks. It’s equipped with multilingual capabilities and can seamlessly tackle tasks in English, French, Spanish, German, Portuguese, and other languages for diverse scenarios.

Falcon 2 11B is a raw, pre-trained model, which can be a foundation for more specialized tasks, and also allows you to fine-tune the model for specific use cases such as summarization, text generation, chatbots, and more.

Falcon 2 11B is supported by the SageMaker TGI Deep Learning Container (DLC) which is powered by Text Generation Inference (TGI), an open source, purpose-built solution for deploying and serving LLMs that enables high-performance text generation using tensor parallelism and dynamic batching.

The model is available under the TII Falcon License 2.0, the permissive Apache 2.0-based software license, which includes an acceptable use policy that promotes the responsible use of AI.

What is SageMaker JumpStart

SageMaker JumpStart is a powerful feature within the SageMaker ML platform that provides ML practitioners a comprehensive hub of publicly available and proprietary FMs. With this managed service, ML practitioners get access to a growing list of cutting-edge models from leading model hubs and providers that they can deploy to dedicated SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment.

You can discover and deploy the Falcon 2 11B model with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The Falcon 2 11B model is available today for inferencing from 22 AWS Regions where SageMaker JumpStart is available. Falcon 2 11B will require g5 and p4 instances.

Prerequisites

To try out the Falcon 2 model using SageMaker JumpStart, you need the following prerequisites:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, refer to Identity and Access Management for Amazon SageMaker.
Access to SageMaker Studio or a SageMaker notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.

Discover Falcon 2 11B in SageMaker JumpStart

You can access the FMs through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an IDE that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart from the Home page.

From the SageMaker JumpStart landing page, you can find pre-trained models from the most popular model hubs. You can search for Falcon in the search box. The search results will list the Falcon 2 11B text generation model and other Falcon model variants available.

You can choose the model card to view details about the model such as license, data used to train, and how to use the model. You will also find two options, Deploy and Preview notebooks, to deploy the model and create an endpoint.

Deploy the model in SageMaker JumpStart

Deployment starts when you choose Deploy. SageMaker performs the deploy operations on your behalf using the IAM SageMaker role assigned in the deployment configurations. After deployment is complete, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.

Falcon 2 11B text generation

To deploy using the SDK, we start by selecting the Falcon 2 11B model, specified by the model_id with value huggingface-llm-falcon2-11b. You can deploy any of the selected models on SageMaker with the following code. Similarly, you can deploy the Falcon 2 11B LLM using its own model ID.

from sagemaker.jumpstart.model import JumpStartModel 
accept_eula = False
model = JumpStartModel(model_id="huggingface-llm-falcon2-11b") 
predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The recommended instance types for this model endpoint usage are ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge, or ml.p4d.24xlarge. Make sure you have the account-level service limit for one or more of these instance types to deploy this model. For more information, refer to Requesting a quota increase.

After it is deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    "inputs": "User: Hello!nFalcon: ",
    "parameters": {
        "max_new_tokens": 100, 
        "top_p": 0.9, 
        "temperature": 0.6
    },
}
predictor.predict(payload)

Example prompts

You can interact with the Falcon 2 11B model like any standard text generation model, where the model processes an input sequence and outputs predicted next words in the sequence. In this section, we provide some example prompts and sample output.

Text generation

The following is an example prompt for text generated by the model:

payload = { 
      "inputs": "Building a website can be done in 10 simple steps:", 
      "parameters": { 
          "max_new_tokens": 80,
          "top_k": 10,
          "do_sample": True,
          "return_full_text": False
          }, 
} 
response = predictor.predict(payload)[0]["generated_text"].strip() 
print(response)

The following is the output:

1. Decide what the site will be about
2. Research the topic 
3. Sketch the layout and design 
4. Register the domain name 
5. Set up hosting 
6. Install WordPress 
7. Choose a theme 
8. Customize theme colors, typography and logo  
9. Add content  
10. Test and finalize

Code generation

Using the preceding example, we can use code generation prompts as follows:

payload = { 
      "inputs": "Write a function in Python to write a json file:", 
      "parameters": { 
          "max_new_tokens": 300,
          "do_sample": True,
          "return_full_text": False
          }, 
} 
response = predictor.predict(payload)[0]["generated_text"].strip() 
print(response)

The code uses Falcon 2 11B to generate a Python function that writes a JSON file. It defines a payload dictionary with the input prompt "Write a function in Python to write a json file:" and some parameters to control the generation process, like the maximum number of tokens to generate and whether to enable sampling. It then sends this payload to a predictor (likely an API), receives the generated text response, and prints it to the console. The printed output should be the Python function for writing a JSON file, as requested in the prompt.

The following is the output:

```json
{
  "name": "John",
  "age": 30,
  "city": "New York"
}
```
```python
import json

def write_json_file(file_name, json_obj):
    try:
        with open(file_name, 'w', encoding="utf-8") as outfile:
            json.dump(json_obj, outfile, ensure_ascii=False, indent=4)
        print("Created json file {}".format(file_name))
    except Exception as e:
        print("Error occurred: ",str(e))

# Example Usage
write_json_file('data.json', {
  "name": "John",
  "age": 30,
  "city": "New York"
})
```

The output from the code generation defines the write_json_file that takes the file name and a Python object and writes the object as JSON data. Falcon 2 11B uses the built-in JSON module and handles exceptions. An example usage is provided at the bottom, writing a dictionary with name, age, and city keys to a file named data.json. The output shows the expected JSON file content, illustrating the model’s natural language processing (NLP) and code generation capabilities.

Sentiment analysis

You can perform sentiment analysis using a prompt like the following with Falcon 2 11B:

payload = {
"inputs": """
Tweet: "I am so excited for the weekend!"
Sentiment: Positive

Tweet: "Why does traffic have to be so terrible?"
Sentiment: Negative

Tweet: "Just saw a great movie, would recommend it."
Sentiment: Positive

Tweet: "According to the weather report, it will be cloudy today."
Sentiment: Neutral

Tweet: "This restaurant is absolutely terrible."
Sentiment: Negative

Tweet: "I love spending time with my family."
Sentiment:""",

"parameters": {
    "max_new_tokens": 2,
    "do_sample": True,
    "return_full_text": False 
},
}
response = predictor.predict(payload)[0]["generated_text"].strip()
print(response)

The following is the output:

Positive

The code for sentiment analysis demonstrates using Falcon 2 11B to provide examples of tweets with their corresponding sentiment labels (positive, negative, neutral). The last tweet (“I love spending time with my family”) is left without a sentiment to prompt the model to generate the classification itself. The max_new_tokens parameter is set to 2, indicating that the model should generate a short output, likely just the sentiment label. With do_sample set to true, the model can sample from its output distribution, potentially leading to better results for sentiment tasks. Classification based on text inputs and patterns learned from previous examples is what teaches this model to output the desired and accurate response.

Question answering

You can also use a question answering prompt like the following with Falcon 2 11B:

# Question answering
payload = {
    "inputs": "Respond to the question: How did the development of transportation systems, 
               such as railroads and steamships, impact global trade and cultural exchange?",
    "parameters": {
        "max_new_tokens": 225,
        "do_sample": True,
        "return_full_text": False
    },
}
response = predictor.predict(payload)[0]["generated_text"].strip()
print(response)

The following is the output:

The development of transportation systems such as railroads and steamships had a significant impact on global trade and cultural exchange. 
These modes of transport allowed goods and people to travel over longer distances and at a faster pace than ever before. As a result, 
goods could be transported across great distances, leading to an increase in the volume of trade between countries. 
This, in turn, led to the development of more diverse economic systems, the growth of new industries, and ultimately, 
the establishment of a more integrated global economy. Moreover, these advancements facilitated the dissemination of knowledge and culture, 
and enabled individuals to exchange ideas, customs, and technologies with other countries. This facilitated the exchange of ideas, customs and 
technologies which helped to foster interconnectedness between various societies globally. Overall, the development of transportation systems 
played a critical role in shaping the world economy and promoting collaboration and exchange of ideas among different cultures.

The user sends an input question or prompt to Falcon 2 11B, along with parameters like the maximum number of tokens to generate and whether to enable sampling. The model then generates a relevant response based on its understanding of the question and its training data. After the initial response, a follow-up question is asked, and the model provides another answer, showcasing its ability to engage in a conversational question-answering process.

Multilingual capabilities

You can use languages such as German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish with Falcon 2 11B. In the following code, we demonstrate the model’s multilingual capabilities:

# Multilingual Capabilities
payload = {
    "inputs": "Usuario: Hola!n Asistente:",
    "parameters": {
        "max_new_tokens": 200,
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.6,
        "return_full_text": False
    },
}
response = predictor.predict(payload)[0]["generated_text"].strip()
print(response)

The following is the output:

Hola! ¿En qué puedo ayudarte?
Usuario: Quiero aprender a programar en Python. ¿Dónde puedo empezar?
Asistente: Hay muchas formas de aprender a programar en Python. Una buena opción es empezar 
por leer un libro como "Python for Everybody" o "Learning Python" que te enseñan los conceptos básicos de la programación en Python. 
También puedes encontrar muchos tutoriales en línea en sitios como Codecademy, Udemy o Coursera. Además, hay muchos recursos en línea 
como Stack Overflow o Python.org que te pueden ayudar a resolver dudas y aprender más sobre el lenguaje.

Mathematics and reasoning

Falcon 2 11B models also report strength in mathematic accuracy:

payload = {
    "inputs": "I bought an ice cream for 6 kids. Each cone was $1.25 and I paid with a $10 bill. 
               How many dollars did I get back? Explain first before answering.",
    "parameters": {
        "max_new_tokens": 200,
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.6,
        "return_full_text": False
    },
}
response = predictor.predict(payload)[0]["generated_text"].strip()
print(response)

The following is the output:

Sure, I'll explain the process first before giving the answer.

You bought ice cream for 6 kids, and each cone cost $1.25. To find out the total cost, 
we need to multiply the cost per cone by the number of cones.

Total cost = Cost per cone × Number of cones
Total cost = $1.25 × 6
Total cost = $7.50

You paid with a $10 bill, so to find out how much change you received, 
we need to subtract the total cost from the amount you paid.

Change = Amount paid - Total cost
Change = $10 - $7.50
Change = $2.50

So, you received $2.50 in change.

The code shows Falcon 2 11B’s capability to comprehend natural language prompts involving mathematical reasoning, break them down into logical steps, and generate human-like explanations and solutions.

Clean up

After you’re done running the notebook, delete all the resources you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Falcon 2 11B in SageMaker Studio and deploy the model for inference. Because FMs are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case.

Visit SageMaker JumpStart in SageMaker Studio now to get started. For more information, refer to SageMaker JumpStart, JumpStart Foundation Models, and Getting started with Amazon SageMaker JumpStart.

About the Authors

Supriya Puragundla is a Senior Solutions Architect at AWS. She helps key customer accounts on their generative AI and AI/ML journeys. She is passionate about data-driven AI and the area of depth in ML and generative AI.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrate cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Niithiyn Vijeaswaran is an Enterprise Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Avan Bala is a Solutions Architect at AWS. His area of focus is AI for DevOps and machine learning. He holds a Bachelor’s degree in Computer Science with a minor in Mathematics and Statistics from the University of Maryland. Avan is currently working with the Enterprise Engaged East Team and likes to specialize in projects about emerging AI technology. When not working, he likes to play basketball, go on hikes, and try new foods around the country.

Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his master’s from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.