Amazon AWS – Page 72

Automate the machine learning model approval process with Amazon SageMaker Model Registry and Amazon SageMaker Pipelines

August 7, 2024

by Jason Sizer McIntosh Amazon AWS

Innovations in artificial intelligence (AI) and machine learning (ML) are causing organizations to take a fresh look at the possibilities these technologies can offer. As you aim to bring your proofs of concept to production at an enterprise scale, you may experience challenges aligning with the strict security compliance requirements of their organization. In the face of these challenges, MLOps offers an important path to shorten your time to production while increasing confidence in the quality of deployed workloads by automating governance processes.

ML models in production are not static artifacts. They reflect the environment where they are deployed and, therefore, require comprehensive monitoring mechanisms for model quality, bias, and feature importance. Organizations often want to introduce additional compliance checks that validate that the model aligns with their organizational standards before it is deployed. These frequent manual checks can create long lead times to deliver value to customers. Automating these checks allows them to be repeated regularly and consistently rather than organizations having to rely on infrequent manual point- in-time checks.

This post illustrates how to use common architecture principles to transition from a manual monitoring process to one that is automated. You can use these principles and existing AWS services such as Amazon SageMaker Model Registry and Amazon SageMaker Pipelines to deliver innovative solutions to your customers while maintaining compliance for your ML workloads.

Challenge

As AI becomes ubiquitous, it’s increasingly used to process information and interact with customers in a sensitive context. Suppose a tax agency is interacting with its users through a chatbot. It’s important that this new system aligns with organizational guidelines by allowing developers to have a high degree of confidence that it responds accurately and without bias. At maturity, an organization may have tens or even hundreds of models in production. How can you make sure every model is properly vetted before it’s deployed and on each deployment?

Traditionally, organizations have created manual review processes to keep updated code from becoming available to the public through mechanisms such as an Enterprise Review Committee (ERC), Enterprise Review Board (ERB), or a Change Advisory Board (CAB).

Just as mechanisms have evolved with the rise of continuous integration and continuous delivery (CI/CD), MLOps can reduce the need for manual processes while increasing the frequency and thoroughness of quality checks. Through automation, you can scale in-demand skillsets, such as model and data analysis, introducing and enforcing in-depth analysis of your models at scale across diverse product teams.

In this post, we use SageMaker Pipelines to define the required compliance checks as code. This allows you to introduce analysis of arbitrary complexity while not being limited by the busy schedules of highly technical individuals. Because the automation takes care of repetitive analytics tasks, technical resources can focus on relentlessly improving the quality and thoroughness of the MLOps pipeline to improve compliance posture, and make sure checks are performing as expected.

Deployment of an ML model to production generally requires at least two artifacts to be approved: the model and the endpoint. In our example, the organization is willing to approve a model for deployment if it passes their checks for model quality, bias, and feature importance prior to deployment. Secondly, the endpoint can be approved for production if it performs as expected when deployed into a production-like environment. In a subsequent post, we walk you through how to deploy a model and implement sample compliance checks. In this post, we discuss how you can extend this process to large language models (LLMs), which produce a varied set of outputs and introduce complexities regarding automated quality assurance checks.

Aligning with AWS multi-account best practices

The solution outlined in this post spans across several accounts in a given AWS organization. For a deeper look at the various components required for an AWS organization multi-account enterprise ML environment, see MLOps foundation roadmap for enterprises with Amazon SageMaker. In this post, we refer to the advanced analytics governance account as the AI/ML governance account. We focus on the development of the enforcement mechanism for the centralized automated model approval within this account.

This account houses centralized components such as a model registry on SageMaker Model Registry, ML project templates on SageMaker Projects, model cards on Amazon SageMaker Model Cards, and container images on Amazon Elastic Container Registry (Amazon ECR).

We use an isolated environment (in this case, a separate AWS environment) to deploy and promote across various environments. You can modify the strategies discussed in this post along the spectrum of centralized vs. decentralized depending on the posture of your organization. For this example, we provide a centralized model. You can also extend this model to align with strict compliance requirements. For example, the AI/ML governance team trusts the development teams are sending the correct bias and explainability reports for a given model. Additional checks could be included to “trust by verify” to further bolster the posture of this organization. Additional complexities such as this are not addressed in this post. To dive further into the topic of MLOps secure implementations, refer to Amazon SageMaker MLOps: from idea to production in six steps.

Solution overview

The following diagram illustrates the solution architecture using SageMaker Pipelines to automate model approval.

The workflow comprises a comprehensive process for model building, training, evaluation, and approval within an organization containing different AWS accounts, integrating various AWS services. The detailed steps are as follows:

Data scientists from the product team use Amazon SageMaker Studio to create Jupyter notebooks used to facilitate data preprocessing and model pre-building. The code is committed to AWS CodeCommit, a managed source control service. Optionally, you can commit to third-party version control systems such as GitHub, GitLab, or Enterprise Git.
The commit to CodeCommit invokes the SageMaker pipeline, which runs several steps, including model building and training, and running processing jobs using Amazon SageMaker Clarify to generate bias and explainability reports.
- SageMaker Clarify processes and stores its outputs, including model artifacts and reports in JSON format, in an Amazon Simple Storage Service (Amazon S3) bucket.
- A model is registered in the SageMaker model registry with a model version.
The Amazon S3 PUT action invokes an AWS Lambda
This Lambda function copies all the artifacts from the S3 bucket in the development account to another S3 bucket in the AI/ML governance account, providing restricted access and data integrity. This post assumes your accounts and S3 buckets are in the same AWS Region. For cross-Region copying, see Copy data from an S3 bucket to another account and Region by using the AWS CLI.
Registering the model invokes a default Amazon CloudWatch event associated with SageMaker model registry actions.
The CloudWatch event is consumed by Amazon EventBridge, which invokes another Lambda
This Lambda function is tasked with starting the SageMaker approval pipeline.
The SageMaker approval pipeline evaluates the artifacts against predefined benchmarks to determine if they meet the approval criteria.
Based on the evaluation, the pipeline updates the model status to approved or rejected accordingly.

This workflow provides a robust, automated process for model approval using AWS’s secure, scalable infrastructure and services. Each step is designed to make sure that only models meeting the set criteria are approved, maintaining high standards for model performance and fairness.

Prerequisites

To implement this solution, you need to first create and register an ML model in the SageMaker model registry with the necessary SageMaker Clarify artifacts. You can create and run the pipeline by following the example provided in the following GitHub repository.

The following sections assume that a model package version has been registered with status Pending Manual Approval. This status allows you to build an approval workflow. You can either have a manual approver or set up an automated approval workflow based on metrics checks in the aforementioned reports.

Build your pipeline

SageMaker Pipelines allows you to define a series of interconnected steps defined as code using the Pipelines SDK. You can extend the pipeline to help meet your organizational needs with both automated and manual approval steps. In this example, we build the pipeline to include two major steps. The first step evaluates artifacts uploaded to the AI/ML governance account by the model build pipeline against threshold values set by model registry administrators for model quality, bias, and feature importance. The second step receives the evaluation and updates the model’s status and metadata based on the values received. The pipeline is represented in SageMaker Pipelines by the following DAG.

Next, we dive into the code required for the pipeline and its steps. First, we define a pipeline session to help manage AWS service integration as we define our pipeline. This can be done as follows:

pipeline_session = PipelineSession()

Each step runs as a SageMaker Processor for which we specify a small instance type due to the minimal compute requirements of our pipeline. The processor can be defined as follows:

from sagemaker.processing import Processor
step_processor=Processor(
    image_uri=image_uri,
    role=role, 
    instance_type="ml.t3.medium", 
    base_job_name=base_job_name,
    instance_count=1,  
    sagemaker_session=pipeline_session,
)

We then define the pipeline steps using step_processor.run(…) as the input parameter to run our custom script inside the defined environment.

Validate model package artifacts

The first step takes two arguments: default_bucket and model_package_group_name. It outputs the results of the checks in JSON format stored in Amazon S3. The step is defined as follows:

process_step = ProcessingStep(
    name="RegisteredModelValidationStep",
    step_args= step_processor.run(
        code="automated-model-approval/model-approval-checks.py",
        inputs=[],
        outputs=[
            ProcessingOutput(
                output_name="checks",
                destination=f"s3://{default_bucket}/governance-pipeline/processor/",
                source="/opt/ml/processing/output"
        )],
        arguments=[
            "--default_bucket", default_bucket_s3, 
            "--model_package_group_name", model_package_group_name
        ]
    )
)

This step runs the custom script passed to the code parameter. We now explore this script in more detail.

Values passed to arguments can be parsed using standard methods like argparse and will be used throughout the script. We use these values to retrieve the model package. We then parse the model package’s metadata to find the location of the model quality, bias, and explainability reports. See the following code:

model_package_arn = client.list_model_packages(ModelPackageGroupName=model_package_group_name)[
        "ModelPackageSummaryList"
    ][0]["ModelPackageArn"]
    model_package_metrics = 
client.describe_model_package(ModelPackageName=model_package_arn)["ModelMetrics"]
model_quality_s3_key = model_package_metrics["ModelQuality"]["Statistics"]["S3Uri"].split(f"{default_bucket}/")[1]
model_quality_bias = model_package_metrics["Bias"]
model_quality_pretrain_bias_key = model_quality_bias["PreTrainingReport"]["S3Uri"].split(f"{default_bucket}/")[1]
model_quality__post_train_bias_key = model_quality_bias["PostTrainingReport"]["S3Uri"].split(f"{default_bucket}/")[1]
model_explainability_s3_key = model_package_metrics["Explainability"]["Report"]["S3Uri"].split(f"{default_bucket}/")[1]

The reports retrieved are simple JSON files we can then parse. In the following example, we retrieve the treatment equity and compare to our threshold in order to return a True or False result. Treatment equity is defined as the difference in the ratio of false negatives to false positives for the advantaged vs. disadvantaged group. We arbitrarily set the optimal threshold to be 0.8.

s3_obj = s3_client.get_object(Bucket=default_bucket, Key=model_quality__post_train_bias_key)
s3_obj_data = s3_obj['Body'].read().decode('utf-8')
model_quality__post_train_bias_json = json.loads(s3_obj_data)
treatment_equity = model_quality__post_train_bias_json["post_training_bias_metrics"][
        "facets"]["column_8"][0]["metrics"][-1]["value"]
treatment_equity_check_threshold = 0.8
treatment_equity_check = True if treatment_equity < treatment_equity_check_threshold else False

After running through the measures of interest, we return the true/false checks to a JSON file that will be copied to Amazon S3 as per the output variable of the ProcessingStep.

Update the model package status in the model registry

When the initial step is complete, we use the JSON file created in Amazon S3 as input to update the model package’s status and metadata. See the following code:

update_model_status_step = ProcessingStep(
    name="UpdateModelStatusStep",
    step_args=step_processor.run(
        code="automated-model-approval/validate-model.py",
        inputs=[
            ProcessingInput(
                source=process_step.properties.ProcessingOutputConfig.Outputs[
                    "checks"
                ].S3Output.S3Uri,
                destination="/opt/ml/processing/input",
            ),
        ],
        outputs=[],
        arguments=[
            "--model_package_group_name", model_package_group_name
        ]
    ),
)

This step runs the custom script passed to the code parameter. We now explore this script in more detail. First, parse the values in checks.json to evaluate if the model passed all checks or review the reasons for failure:

is_approved = True
reasons = []
with open('/opt/ml/processing/input/checks.json') as checks:
        checks = json.load(checks)
        print(f"checks: {checks}")
        for key, value in checks.items():            
            if not value:
                is_approved = False
                reasons.append(key)

After we know if the model should be approved or rejected, we update the model status and metadata as follows:

if is_approved:
        approval_description = "Model package meets organisational guidelines"
else:
        approval_description = "Model values for the following checks does not meet threshold: "

for reason in reasons:
approval_description+= f"{reason} "
        
model_package_update_input_dict = {
        "ModelPackageArn" : model_package_arn,
        "ApprovalDescription": approval_description,
        "ModelApprovalStatus" : "Approved" if is_approved else "Rejected"
    }
    
model_package_update_response = client.update_model_package(**model_package_update_input_dict)

This step produces a model with a status of Approved or Rejected based on the set of checks specified in the first step.

Orchestrate the steps as a SageMaker pipeline

We orchestrate the previous steps as a SageMaker pipeline with two parameter inputs passed as arguments to the various steps:

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.parameters import ParameterString

model_package_group_name = ParameterString(
name="ModelPackageGroupName", default_value="ModelPackageGroupName is required variable."
)

default_bucket_s3 = ParameterString(
name="Bucket", default_value="Bucket is required variable")

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[model_package_group_name, default_bucket_s3],
    steps=[process_step, update_model_status_step],
)

It’s straightforward to extend this pipeline by adding elements into the list passed to the steps parameter. In the next section, we explore how to run this pipeline as new model packages are registered to our model registry.

Run the event-driven pipeline

In this section, we outline how to invoke the pipeline using an EventBridge rule and Lambda function.

Create a Lambda function and select the Python 3.9 runtime. The following function retrieves the model package ARN, the model package group name, and the S3 bucket where the artifacts are stored based on the event. It then starts running the pipeline using these values:

import json
import boto3
sagemaker_client = boto3.client('sagemaker')

def lambda_handler(event, context):
    model_arn = event.get('detail', {}).get('ModelPackageArn', 'Unknown')
    model_package_group_name = event.get('detail', {}).get('ModelPackageGroupName', 'Unknown') 
    model_package_name = event.get('detail', {}).get('ModelPackageName', 'Unknown') 
    model_data_url = event.get('InferenceSpecification', {}).get('ModelDataUrl', 'Unknown')        
    
    # Specify the name of your SageMaker pipeline
    pipeline_name = 'model-governance-pipeline'
    
    # Define multiple parameters
    pipeline_parameters = [
    {'Name': "ModelPackageGroupName", 'Value': model_package_group_name}, {'Name': "Bucket", 'Value': model_data_url},
   ]
    # Start the pipeline execution
    response = sagemaker_client.start_pipeline_execution(
    	PipelineName=pipeline_name,
    	PipelineExecutionDisplayName=pipeline_name,
    	PipelineParameters=pipeline_parameters
    )
    
    # Return the response
    return response

After defining the Lambda function, we create the EventBridge rule to automatically invoke the function when a new model package is registered with PendingManualApproval into the model registry. You can use AWS CloudFormation and the following template to create the rule:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "CloudFormation template for EventBridge rule 'invoke-model-approval-checks'",
  "Resources": {
    "EventRule0": {
      "Type": "AWS::Events::Rule",
      "Properties": {
        "EventBusName": "default",
        "EventPattern": {
          "source": ["aws.sagemaker"],
          "detail-type": ["SageMaker Model Package State Change"],
          "detail": {
            "ModelApprovalStatus": ["PendingManualApproval"]
          }
        },
        "Name": "invoke-model-approval-checks",
        "State": "ENABLED",
        "Targets": [{
          "Id": "Id403a084c-2837-4408-940f-b808389653d1",
          "Arn": "<Your Lambda function ARN>"
        }]
      }
    }
  }
}

We now have a SageMaker pipeline consisting of two steps being invoked when a new model is registered to evaluate model quality, bias, and feature importance metrics and update the model status accordingly.

Applying this approach to generative AI models

In this section, we explore how the complexities introduced by LLMs change the automated monitoring workflow.

Traditional ML models typically produce concise outputs with obvious ground truths in their training dataset. In contrast, LLMs can generate long, nuanced sequences that may have little to no ground truth due to the autoregressive nature of training this segment of model. This strongly influences various components of the governance pipeline we’ve described.

For instance, in traditional ML models, bias is detected by looking at the distributions of labels over different population subsets (for example, male vs. female). The labels (often a single number or a few numbers) are a clear and simple signal used to measure bias. In contrast, generative models produce lengthy and complex answers, which don’t provide an obvious signal to be used for monitoring. HELM (a holistic framework for evaluating foundation models) allows you to simplify monitoring by untangling the evaluation process into metrics of concern. This includes accuracy, calibration and uncertainty, robustness, fairness, bias and stereotypes, toxicity, and efficiency. We then apply downstream processes to measure for these metrics independently. This is generally done using standardized datasets composed of examples and a variety of accepted responses.

We concretely evaluate four metrics of interest to any governance pipelines for LLMs: memorization and copyright, disinformation, bias, and toxicity, as described in HELM. This is done by collecting inference results from the model pushed to the model registry. The benchmarks include:

Memorization and copyright with books from bookscorpus, which uses popular books from a bestseller list and source code of the Linux kernel. This can be quickly extended to include a number of copyrighted works.
Disinformation with headlines from the MisinfoReactionFrames dataset, which has false headlines across a number of topics.
Bias with Bias Benchmark for Question Answering (BBQ). This QA dataset works to highlight biases affecting various social groups.
Toxicity with Bias in Open-ended Language Generation Dataset (BOLD), which benchmarks across profession, gender, race, religion, and political ideology.

Each of these datasets is publicly available. They each allow complex aspects of a generative model’s behavior to be isolated and distilled down to a single number. This flow is described in the following architecture.

For a detailed view of this topic along with important mechanisms to scale in production, refer to Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services.

Conclusion

In this post, we discussed a sample solution to begin automating your compliance checks for models going into production. As AI/ML becomes increasingly common, organizations require new tools to codify the expertise of their highly skilled employees in the AI/ML space. By embedding your expertise as code and running these automated checks against models using event-driven architectures, you can increase both the speed and quality of models by empowering yourself to run these checks as needed rather than relying on the availability of individuals for manual compliance or quality assurance reviews By using well-known CI/CD techniques in the application development lifecycle and applying them to the ML modeling lifecycle, organizations can scale in the era of generative AI.

If you have any thoughts or questions, please leave them in the comments section.

About the Authors

Jayson Sizer McIntosh is a Senior Solutions Architect at Amazon Web Services (AWS) in the World Wide Public Sector (WWPS) based in Ottawa (Canada) where he primarily works with public sector customers as an IT generalist with a focus on Dev(Sec)Ops/CICD. Bringing his experience implementing cloud solutions in high compliance environments, he is passionate about helping customers successfully deliver modern cloud-based services to their users.

Nicolas Bernier is an AI/ML Solutions Architect, part of the Canadian Public Sector team at AWS. He is currently conducting research in Federated Learning and holds five AWS certifications, including the ML Specialty Certification. Nicolas is passionate about helping customers deepen their knowledge of AWS by working with them to translate their business challenges into technical solutions.

Pooja Ayre is a seasoned IT professional with over 9 years of experience in product development, having worn multiple hats throughout her career. For the past two years, she has been with AWS as a Solutions Architect, specializing in AI/ML. Pooja is passionate about technology and dedicated to finding innovative solutions that help customers overcome their roadblocks and achieve their business goals through the strategic use of technology. Her deep expertise and commitment to excellence make her a trusted advisor in the IT industry.

Build custom generative AI applications powered by Amazon Bedrock

August 6, 2024

by Vasi Philomin Amazon AWS

With last month’s blog, I started a series of posts that highlight the key factors that are driving customers to choose Amazon Bedrock. I explored how Bedrock enables customers to build a secure, compliant foundation for generative AI applications. Now I’d like to turn to a slightly more technical, but equally important differentiator for Bedrock—the multiple techniques that you can use to customize models and meet your specific business needs.

As we’ve all heard, large language models (LLMs) are transforming the way we leverage artificial intelligence (AI) and enabling businesses to rethink core processes. Trained on massive datasets, these models can rapidly comprehend data and generate relevant responses across diverse domains, from summarizing content to answering questions. The wide applicability of LLMs explains why customers across healthcare, financial services, and media and entertainment are moving quickly to adopt them. However, our customers tell us that while pre-trained LLMs excel at analyzing vast amounts of data, they often lack the specialized knowledge necessary to tackle specific business challenges.

Customization unlocks the transformative potential of large language models. Amazon Bedrock equips you with a powerful and comprehensive toolset to transform your generative AI from a one-size-fits-all solution into one that is finely tailored to your unique needs. Customization includes varied techniques such as Prompt Engineering, Retrieval Augmented Generation (RAG), and fine-tuning and continued pre-training. Prompt Engineering involves carefully crafting prompts to get a desired response from LLMs. RAG combines knowledge retrieved from external sources with language generation to provide more contextual and accurate responses. Model Customization techniques—including fine-tuning and continued pre-training involve further training a pre-trained language model on specific tasks or domains for improved performance. These techniques can be used in combination with each other to train base models in Amazon Bedrock with your data to deliver contextual and accurate outputs. Read the below examples to understand how customers are using customization in Amazon Bedrock to deliver on their use cases.

Thomson Reuters, a global content and technology company, has seen positive results with Claude 3 Haiku, but anticipates even better results with customization. The company—which serves professionals in legal, tax, accounting, compliance, government, and media—expects that it will see even faster and more relevant AI results by fine-tuning Claude with their industry expertise.

“We’re excited to fine-tune Anthropic’s Claude 3 Haiku model in Amazon Bedrock to further enhance our Claude-powered solutions. Thomson Reuters aims to provide accurate, fast, and consistent user experiences. By optimizing Claude around our industry expertise and specific requirements, we anticipate measurable improvements that deliver high-quality results at even faster speeds. We’ve already seen positive results with Claude 3 Haiku, and fine-tuning will enable us to tailor our AI assistance more precisely.”

– Joel Hron, Chief Technology Officer at Thomson Reuters.

At Amazon, we see Buy with Prime using Amazon Bedrock’s cutting-edge RAG-based customization capabilities to drive greater efficiency. Their order on merchants’ sites are covered by Buy with Prime Assist, 24/7 live chat customer service. They recently launched a chatbot solution in beta capable of handling product support queries. The solution is powered by Amazon Bedrock and customized with data to go beyond traditional email-based systems. My colleague Amit Nandy, Product Manager at Buy with Prime, says,

“By indexing merchant websites, including subdomains and PDF manuals, we constructed tailored knowledge bases that provided relevant and comprehensive support for each merchant’s unique offerings. Combined with Claude’s state-of-the-art foundation models and Guardrails for Amazon Bedrock, our chatbot solution delivers a highly capable, secure, and trustworthy customer experience. Shoppers can now receive accurate, timely, and personalized assistance for their queries, fostering increased satisfaction and strengthening the reputation of Buy with Prime and its participating merchants.”

Stories like these are the reason why we continue to double down on our customization capabilities for generative AI applications powered by Amazon Bedrock.

In this blog, we’ll explore the three major techniques for customizing LLMs in Amazon Bedrock. And, we’ll cover related announcements from the recent AWS New York Summit.

Prompt Engineering: Guiding your application toward desired answers

Prompts are the primary inputs that drive LLMs to generate answers. Prompt engineering is the practice of carefully crafting these prompts to guide LLMs effectively. Learn more here. Well-designed prompts can significantly boost a model’s performance by providing clear instructions, context, and examples tailored to the task at hand. Amazon Bedrock supports multiple prompt engineering techniques. For example, few-shot prompting provides examples with desired outputs to help models better understand tasks, such as sentiment analysis samples labeled “positive” or “negative.” Zero-shot prompting provides task descriptions without examples. And chain-of-thought prompting enhances multi-step reasoning by asking models to break down complex problems, which is useful for arithmetic, logic, and deductive tasks.

Our Prompt Engineering Guidelines outline various prompting strategies and best practices for optimizing LLM performance across applications. Leveraging these techniques can help practitioners achieve their desired outcomes more effectively. However, developing optimal prompts that elicit the best responses from foundational models is a challenging and iterative process, often requiring weeks of refinement by developers.

Zero-shot prompting	Few-shot prompting

Chain-of-thought prompting with Prompt Flows Visual Builder

Retrieval-Augmented Generation: Augmenting results with retrieved data

LLMs generally lack specialized knowledge, jargon, context, or up-to-date information needed for specific tasks. For instance, legal professionals seeking reliable, current, and accurate information within their domain may find interactions with generalist LLMs inadequate. Retrieval-Augmented Generation (RAG) is the process of allowing a language model to consult an authoritative knowledge base outside of its training data sources—before generating a response.

The RAG process involves three main steps:

Retrieval: Given an input prompt, a retrieval system identifies and fetches relevant passages or documents from a knowledge base or corpus.
Augmentation: The retrieved information is combined with the original prompt to create an augmented input.
Generation: The LLM generates a response based on the augmented input, leveraging the retrieved information to produce more accurate and informed outputs.

Amazon Bedrock’s Knowledge Bases is a fully managed RAG feature that allows you to connect LLMs to internal company data sources—delivering relevant, accurate, and customized responses. To offer greater flexibility and accuracy in building RAG-based applications, we announced multiple new capabilities at the AWS New York Summit. For example, now you can securely access data from new sources like the web (in preview), allowing you to index public web pages, or access enterprise data from Confluence, SharePoint, and Salesforce (all in preview). Advanced chunking options are another exciting new feature, enabling you to create custom chunking algorithms tailored to your specific needs, as well as leverage built-in semantic and hierarchical chunking options. You now have the capability to extract information with precision from complex data formats (e.g., complex tables within PDFs), thanks to advanced parsing techniques. Plus, the query reformulation feature allows you to deconstruct complex queries into simpler sub-queries, enhancing retrieval accuracy. All these new features help you reduce the time and cost associated with data access and construct highly accurate and relevant knowledge resources—all tailored to your specific enterprise use cases.

Model Customization: Enhancing performance for specific tasks or domains

Model customization in Amazon Bedrock is a process to customize pre-trained language models for specific tasks or domains. It involves taking a large, pre-trained model and further training it on a smaller, specialized dataset related to your use case. This approach leverages the knowledge acquired during the initial pre-training phase while adapting the model to your requirements, without losing the original capabilities. The fine-tuning process in Amazon Bedrock is designed to be efficient, scalable, and cost-effective, enabling you to tailor language models to your unique needs, without the need for extensive computational resources or data. In Amazon Bedrock, model fine-tuning can be combined with prompt engineering or the Retrieval-Augmented Generation (RAG) approach to further enhance the performance and capabilities of language models. Model customization can be implemented both for labeled and unlabeled data.

Fine-Tuning with labeled data involves providing labeled training data to improve the model’s performance on specific tasks. The model learns to associate appropriate outputs with certain inputs, adjusting its parameters for better task accuracy. For instance, if you have a dataset of customer reviews labeled as positive or negative, you can fine-tune a pre-trained model within Bedrock on this data to create a sentiment analysis model tailored to your domain. At the AWS New York Summit, we announced Fine-tuning for Anthropic’s Claude 3 Haiku. By providing task-specific training datasets, users can fine-tune and customize Claude 3 Haiku, boosting its accuracy, quality, and consistency for their business applications.

Continued Pre-training with unlabeled data, also known as domain adaptation, allows you to further train the LLMs on your company’s proprietary, unlabeled data. It exposes the model to your domain-specific knowledge and language patterns, enhancing its understanding and performance for specific tasks.

Customization holds the key to unlocking the true power of generative AI

Large language models are revolutionizing AI applications across industries, but tailoring these general models with specialized knowledge is key to unlocking their full business impact. Amazon Bedrock empowers organizations to customize LLMs through Prompt Engineering techniques, such as Prompt Management and Prompt Flows, that help craft effective prompts. Retrieval-Augmented Generation—powered by Amazon Bedrock’s Knowledge Bases—lets you integrate LLMs with proprietary data sources to generate accurate, domain-specific responses. And Model Customization techniques, including fine-tuning with labeled data and continued pre-training with unlabeled data, help optimize LLM behavior for your unique needs. After taking a close look at these three main customization methods, it’s clear that while they may take different approaches, they all share a common goal—to help you address your specific business problems..

Resources

For more information on customization with Amazon Bedrock, check the below resources:

Learn more about Amazon Bedrock
Learn more about Amazon Bedrock Knowledge Bases
Read announcement blog on additional data connectors in Knowledge Bases for Amazon Bedrock
Read blog on advanced chunking and parsing options in Knowledge Bases for Amazon Bedrock
Learn more about Prompt Engineering
Learn more about Prompt Engineering techniques and best practices
Read announcement blog on Prompt Management and Prompt Flows
Learn more about fine-tuning and continued pre-training
Read the announcement blog on fine-tuning Anthropic’s Claude 3 Haiku

About the author

Vasi Philomin is VP of Generative AI at AWS. He leads generative AI efforts, including Amazon Bedrock and Amazon Titan.

Use Amazon Bedrock to generate, evaluate, and understand code in your software development pipeline

August 6, 2024

by Ian Lenora Amazon AWS

Generative artificial intelligence (AI) models have opened up new possibilities for automating and enhancing software development workflows. Specifically, the emergent capability for generative models to produce code based on natural language prompts has opened many doors to how developers and DevOps professionals approach their work and improve their efficiency. In this post, we provide an overview of how to take advantage of the advancements of large language models (LLMs) using Amazon Bedrock to assist developers at various stages of the software development lifecycle (SDLC).

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

The following process architecture proposes an example SDLC flow that incorporates generative AI in key areas to improve the efficiency and speed of development.

The intent of this post is to focus on how developers can create their own systems to augment, write, and audit code by using models within Amazon Bedrock instead of relying on out-of-the-box coding assistants. We discuss the following topics:

A coding assistant use case to help developers write code faster by providing suggestions
How to use the code understanding capabilities of LLMs to surface insights and recommendations
An automated application generation use case to generate functioning code and automatically deploy changes into a working environment

Considerations

It’s important to consider some technical options when choosing your model and approach to implementing this functionality at each step. One such option is the base model to use for the task. With each model having been trained on a different corpus of data, there will inherently be different task performance per model. Anthropic’s Claude 3 on Amazon Bedrock models write code effectively out of the box in many common coding languages, for example, whereas others may not be able to reach that performance without further customization. Customization, however, is another technical choice to make. For instance, if your use case includes a less common language or framework, customizing the model through fine-tuning or using Retrieval Augmented Generation (RAG) may be necessary to achieve production-quality performance, but involves more complexity and engineering effort to implement effectively.

There is an abundance of literature breaking down these trade-offs; for this post, we are just describing what should be explored in its own right. We are simply laying the context that goes into the builder’s initial steps in implementing their generative AI-powered SDLC journey.

Coding assistant

Coding assistants are a very popular use case, with an abundance of examples from which to choose. AWS offers several services that can be applied to assist developers, either through in-line completion from tools like Amazon CodeWhisperer, or to be interacted with via natural language using Amazon Q. Amazon Q for builders has several implementations of this functionality, such as:

In nearly all the use cases described, there can be an integration with the chat interface and assistants. The use cases here are focused on more direct code generation use cases using natural language prompts. This is not to be confused with in-line generation tools that focus on autocompleting a coding task.

The key benefit of an assistant over in-line generation is that you can start new projects based on simple descriptions. For instance, you can describe that you want a serverless website that will allow users to post in blog fashion, and Amazon Q can start building the project by providing sample code and making recommendations on which frameworks to use to do this. This natural language entry point can give you a template and framework to operate within so you can spend more time on the differentiating logic of your application rather than the setup of repeatable and commoditized components.

Code understanding

It’s common for a company that begins to experiment with generative AI to augment the productivity of their individual developers to then use LLMs to infer meaning and functionality of code to improve the reliability, efficiency, security, and speed of the development process. Code understanding by humans is a central part of the SDLC: creating documentation, performing code reviews, and applying best practices. Onboarding new developers can be a challenge even for mature teams. Instead of a more senior developer taking time to respond to questions, an LLM with awareness of the code base and the team’s coding standards could be used to explain sections of code and design decisions to the new team member. The onboarding developer has everything they need with a rapid response time and the senior developer can focus on building. In addition to user-facing behaviors, this same mechanism can be repurposed to work completely behind the scenes to augment existing continuous integration and continuous delivery (CI/CD) processes as an additional reviewer.

For instance, you can use prompt engineering techniques to guide and automate the application of coding standards, or include the existing code base as referential material to use custom APIs. You can also take proactive measures by prefixing each prompt with a reminder to follow the coding standards and make a call to get them from document storage, passing them to the model as context with the prompt. As a retroactive measure, you can add a step during the review process to check the written code against the standards to enforce adherence, similar to how a team code review would work. For example, let’s say that one of the team’s standards is to reuse components. During the review step, the model can read over a new code submission, note that the component already exists in the code base, and suggest to the reviewer to reuse the existing component instead of recreating it.

The following diagram illustrates this type of workflow.

Application generation

You can extend the concepts from the use cases described in this post to create a full application generation implementation. In the traditional SDLC, a human creates a set of requirements, makes a design for the application, writes some code to implement that design, builds tests, and receives feedback on the system from external sources or people, and then the process repeats. The bottleneck in this cycle typically comes at the implementation and testing phases. An application builder needs to have substantive technical skills to write code effectively, and there are typically numerous iterations required to debug and perfect code—even for the most skilled builders. In addition, a foundational knowledge of a company’s existing code base, APIs, and IP are fundamental to implementing an effective solution, which can take humans a long time to learn. This can slow down the time to innovation for new teammates or teams with technical skills gaps. As mentioned earlier, if models can be used with the capability to both create and interpret code, pipelines can be created that perform the developer iterations of the SDLC by feeding outputs of the model back in as input.

The following diagram illustrates this type of workflow.

For example, you can use natural language to ask a model to write an application that prints all the prime numbers between 1–100. It returns a block of code that can be run with applicable tests defined. If the program doesn’t run or some tests fail, the error and failing code can be fed back into the model, asking it to diagnose the problem and suggest a solution. The next step in the pipeline would be to take the original code, along with the diagnosis and suggested solution, and stitch the code snippets together to form a new program. The SDLC restarts in the testing phase to get new results, and either iterates again or a working application is produced. With this basic framework, an increasing number of components can be added in the same manner as in a traditional human-based workflow. This modular approach can be continuously improved until there is a robust and powerful application generation pipeline that simply takes in a natural language prompt and outputs a functioning application, handling all of the error correction and best practice adherence behind the scenes.

The following diagram illustrates this advanced workflow.

Conclusion

We are at the point in the adoption curve of generative AI that teams are able to get real productivity gains from using the variety of techniques and tools available. In the near future, it will be imperative to take advantage of these productivity gains to stay competitive. One thing we do know is that the landscape will continue to rapidly progress and change, so building a system tolerant of change and flexibility is key. Developing your components in a modular fashion allows for stability in the face of an ever-changing technical landscape while being ready to adopt the latest technology at each step of the way.

For more information about how to get started building with LLMs, see these resources:

About the Authors

Ian Lenora is an experienced software development leader who focuses on building high-quality cloud native software, and exploring the potential of artificial intelligence. He has successfully led teams in delivering complex projects across various industries, optimizing efficiency and scalability. With a strong understanding of the software development lifecycle and a passion for innovation, Ian seeks to leverage AI technologies to solve complex problems and create intelligent, adaptive software solutions that drive business value.

Cody Collins is a New York-based Solutions Architect at Amazon Web Services, where he collaborates with ISV customers to build cutting-edge solutions in the cloud. He has extensive experience in delivering complex projects across diverse industries, optimizing for efficiency and scalability. Cody specializes in AI/ML technologies, enabling customers to develop ML capabilities and integrate AI into their cloud applications.

Samit Kumbhani is an AWS Senior Solutions Architect in the New York City area with over 18 years of experience. He currently collaborates with Independent Software Vendors (ISVs) to build highly scalable, innovative, and secure cloud solutions. Outside of work, Samit enjoys playing cricket, traveling, and biking.

Inference AudioCraft MusicGen models using Amazon SageMaker

August 6, 2024

by Pavan Kumar Rao Navule Amazon AWS

Music generation models have emerged as powerful tools that transform natural language text into musical compositions. Originating from advancements in artificial intelligence (AI) and deep learning, these models are designed to understand and translate descriptive text into coherent, aesthetically pleasing music. Their ability to democratize music production allows individuals without formal training to create high-quality music by simply describing their desired outcomes.

Generative AI models are revolutionizing music creation and consumption. Companies can take advantage of this technology to develop new products, streamline processes, and explore untapped potential, yielding significant business impact. Such music generation models enable diverse applications, from personalized soundtracks for multimedia and gaming to educational resources for students exploring musical styles and structures. It assists artists and composers by providing new ideas and compositions, fostering creativity and collaboration.

One prominent example of a music generation model is AudioCraft MusicGen by Meta. MusicGen code is released under MIT, model weights are released under CC-BY-NC 4.0. MusicGen can create music based on text or melody inputs, giving you better control over the output. The following diagram shows how MusicGen, a single stage auto-regressive Transformer model, can generate high-quality music based on text descriptions or audio prompts.

MusicGen uses cutting-edge AI technology to generate diverse musical styles and genres, catering to various creative needs. Unlike traditional methods that include cascading several models, such as hierarchically or upsampling, MusicGen operates as a single language model, which operates over several streams of compressed discrete music representation (tokens). This streamlined approach empowers users with precise control over generating high-quality mono and stereo samples tailored to their preferences, revolutionizing AI-driven music composition.

MusicGen models can be used across education, content creation, and music composition. They can enable students to experiment with diverse musical styles, generate custom soundtracks for multimedia projects, and create personalized music compositions. Additionally, MusicGen can assist musicians and composers, fostering creativity and innovation.

This post demonstrates how to deploy MusicGen, a music generation model on Amazon SageMaker using asynchronous inference. We specifically focus on text conditioned generation of music samples using MusicGen models.

Solution overview

With the ability to generate audio, music, or video, generative AI models can be computationally intensive and time-consuming. Generative AI models with audio, music, and video output can use asynchronous inference that queues incoming requests and process them asynchronously. Our solution involves deploying the AudioCraft MusicGen model on SageMaker using SageMaker endpoints for asynchronous inference. This entails deploying AudioCraft MusicGen models sourced from the Hugging Face Model Hub onto a SageMaker infrastructure.

The following solution architecture diagram shows how a user can generate music using natural language text as an input prompt by using AudioCraft MusicGen models deployed on SageMaker.

The following steps detail the sequence happening in the workflow from the moment the user enters the input to the point where music is generated as output:

The user invokes the SageMaker asynchronous endpoint using an Amazon SageMaker Studio notebook.
The input payload is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket for inference. The payload consists of both the prompt and the music generation parameters. The generated music will be downloaded from the S3 bucket.
The facebook/musicgen-large model is deployed to a SageMaker asynchronous endpoint. This endpoint is used to infer for music generation.
The HuggingFace Inference Containers image is used as a base image. We use an image that supports PyTorch 2.1.0 with a Hugging Face Transformers framework.
The SageMaker HuggingFaceModel is deployed to a SageMaker asynchronous endpoint.
The Hugging Face model (facebook/musicgen-large) is uploaded to Amazon S3 during deployment. Also, during inference, the generated outputs are uploaded to Amazon S3.
We use Amazon Simple Notification Service (Amazon SNS) topics to notify the success and failure as defined as a part of SageMaker asynchronous inference configuration.

Prerequisites

Make sure you have the following prerequisites in place :

Confirm you have access to the AWS Management Console to create and manage resources in SageMaker, AWS Identity and Access Management (IAM), and other AWS services.
If you’re using SageMaker Studio for the first time, create a SageMaker domain. Refer to Quick setup to Amazon SageMaker to create a SageMaker domain with default settings.
Obtain the AWS Deep Learning Containers for Large Model Inference from pre-built HuggingFace Inference Containers.

Deploy the solution

To deploy the AudioCraft MusicGen model to a SageMaker asynchronous inference endpoint, complete the following steps:

Create a model serving package for MusicGen.
Create a Hugging Face model.
Define asynchronous inference configuration.
Deploy the model on SageMaker.

We detail each of the steps and show how we can deploy the MusicGen model onto SageMaker. For sake of brevity, only significant code snippets are included. The full source code for deploying the MusicGen model is available in the GitHub repo.

Create a model serving package for MusicGen

To deploy MusicGen, we first create a model serving package. The model package contains a requirements.txt file that lists the necessary Python packages to be installed to serve the MusicGen model. The model package also contains an inference.py script that holds the logic for serving the MusicGen model.

Let’s look at the key functions used in serving the MusicGen model for inference on SageMaker:

def model_fn(model_dir):
    '''loads model'''
    model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-large")
    return model

The model_fn function loads the MusicGen model facebook/musicgen-large from the Hugging Face Model Hub. We rely on the MusicgenForConditionalGeneration Transformers module to load the pre-trained MusicGen model.

You can also refer to musicgen-large-load-from-s3/deploy-musicgen-large-from-s3.ipynb, which demonstrates the best practice of downloading the model from the Hugging Face Hub to Amazon S3 and reusing the model artifacts for future deployments. Instead of downloading the model every time from Hugging Face when we deploy or when scaling happens, we download the model to Amazon S3 and reuse it for deployment and during scaling activities. Doing so can improve the download speed, especially for large models, thereby helping prevent the download from happening over the internet from a website outside of AWS. This best practice also maintains consistency, which means the same model from Amazon S3 can be deployed across various staging and production environments.

The predict_fn function uses the data provided during the inference request and the model loaded through model_fn:

texts, generation_params = _process_input(data)
processor = AutoProcessor.from_pretrained("facebook/musicgen-large")
inputs = processor (
    text = texts,
    padding=True,
    return_tensors="pt",
)

Using the information available in the data dictionary, we process the input data to obtain the prompt and generation parameters used to generate the music. We discuss the generation parameters in more detail later in this post.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
audio_values = model.generate(**inputs.to(device),
                                **generation_params)

We load the model to the device and then send the inputs and generation parameters as inputs to the model. This process generates the music in the form of a three-dimensional Torch tensor of shape (batch_size, num_channels, sequence_length).

sampling_rate = model.config.audio_encoder.sampling_rate
disk_wav_locations = _write_wavs_to_disk(sampling_rate, audio_values)
# Upload wavs to S3
result_dict["generated_outputs_s3"] = _upload_wav_files(disk_wav_locations, bucket_name)
# Clean up disk
for wav_on_disk in disk_wav_locations:
    _delete_file_on_disk(wav_on_disk)

We then use the tensor to generate .wav music and upload these files to Amazon S3 and clean up the .wav files saved on disk. We then obtain the S3 URI of the .wav files and send them locations in the response.

We now create the archive of the inference scripts and upload those to the S3 bucket:

musicgen_prefix = 'musicgen_large'
s3_model_key = f'{musicgen_prefix}/model/model.tar.gz'
s3_model_location = f"s3://{sagemaker_session_bucket}/{s3_model_key}"
s3 = boto3.resource("s3")
s3.Bucket(sagemaker_session_bucket).upload_file("model.tar.gz", s3_model_key)

The uploaded URI of this object on Amazon S3 will later be used to create the Hugging Face model.

Create the Hugging Face model

Now we initialize HuggingFaceModel with the necessary arguments. During deployment, the model serving artifacts, stored in s3_model_location, will be deployed. Before the model serving, the MusicGen model will be downloaded from Hugging Face as per the logic in model_fn.

huggingface_model = HuggingFaceModel(
    name=async_endpoint_name,
    model_data=s3_model_location,  # path to your model artifacts 
    role=role,
    env= {
           'TS_MAX_REQUEST_SIZE': '100000000',
           'TS_MAX_RESPONSE_SIZE': '100000000',
           'TS_DEFAULT_RESPONSE_TIMEOUT': '3600'
       },# iam role with permissions to create an Endpoint
    transformers_version="4.37",  # transformers version used
    pytorch_version="2.1",  # pytorch version used
    py_version="py310",  # python version used
)

The env argument accepts a dictionary of parameters such as TS_MAX_REQUEST_SIZE and TS_MAX_RESPONSE_SIZE, which define the byte size values for request and response payloads to the asynchronous inference endpoint. The TS_DEFAULT_RESPONSE_TIMEOUT key in the env dictionary represents the timeout in seconds after which the asynchronous inference endpoint stops responding.

You can run MusicGen with the Hugging Face Transformers library from version 4.31.0 onwards. Here we set transformers_version to 4.37. MusicGen requires at least PyTorch version 2.1 or latest, and we have set pytorch_version to 2.1.

Define asynchronous inference configuration

Music generation using a text prompt as input can be both computationally intensive and time-consuming. Asynchronous inference in SageMaker is designed to address these demands. When working with music generation models, it’s important to note that the process can often take more than 60 seconds to complete.

SageMaker asynchronous inference queues incoming requests and processes them asynchronously, making it ideal for requests with large payload sizes (up to 1 GB), long processing times (up to 1 hour), and near real-time latency requirements. By queuing incoming requests and processing them asynchronously, this capability efficiently handles the extended processing times inherent in music generation tasks. Moreover, asynchronous inference enables seamless auto scaling, making sure that resources are allocated only when needed, leading to cost savings.

Before we proceed with asynchronous inference configuration , we create SNS topics for success and failure that can be used to perform downstream tasks:

from utils.sns_client import SnsClient
import time
sns_client = SnsClient(boto3.client("sns"))
timestamp = time.time_ns()
topic_names = [f"musicgen-large-topic-SuccessTopic-{timestamp}", f"musicgen-large-topic-ErrorTopic-{timestamp}"]

topic_arns = []
for topic_name in topic_names:
    print(f"Creating topic {topic_name}.")
    response = sns_client.create_topic(topic_name)
    topic_arns.append(response.get('TopicArn'))

We now create an asynchronous inference endpoint configuration by specifying the AsyncInferenceConfig object:

# create async endpoint configuration
async_config = AsyncInferenceConfig(
    output_path=s3_path_join(
        "s3://", sagemaker_session_bucket, "musicgen_large/async_inference/output"
    ),  # Where our results will be stored
    # Add nofitication SNS if needed
    notification_config={
        "SuccessTopic": topic_arns[0],
        "ErrorTopic": topic_arns[1],
    },  #  Notification configuration
)

The arguments to the AsyncInferenceConfig are detailed as follows:

output_path – The location where the output of the asynchronous inference endpoint will be stored. The files in this location will have an .out extension and will contain the details of the asynchronous inference performed by the MusicGen model.
notification_config – Optionally, you can associate success and error SNS topics. Dependent workflows can poll these topics to make informed decisions based on the inference outcomes.

Deploy the model on SageMaker

With the asynchronous inference configuration defined, we can deploy the Hugging Face model, setting initial_instance_count to 1:

# deploy the endpoint
async_predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    async_inference_config=async_config,
    endpoint_name=async_endpoint_name,
)

After successfully deploying, you can optionally configure automatic scaling to the asynchronous endpoint. With asynchronous inference, you can also scale down your asynchronous endpoint’s instances to zero.

We now dive into inferencing the asynchronous endpoint for music generation.

Inference

In this section, we show how to perform inference using an asynchronous inference endpoint with the MusicGen model. For the sake of brevity, only significant code snippets are included. The full source code for inferencing the MusicGen model is available in the GitHub repo. The following diagram explains the sequence of steps to invoke the asynchronous inference endpoint.

We detail the steps to invoke the SageMaker asynchronous inference endpoint for MusicGen by prompting a desired mood in natural language using English. We then demonstrate how to download and play the .wav files generated from the user prompt. Finally, we cover the process of cleaning up the resources created as part of this deployment.

Prepare prompt and instructions

For controlled music generation using MusicGen models, it’s important to understand various generation parameters:

generation_params = { 
    'guidance_scale': 3,
    'max_new_tokens': 1200, 
    'do_sample': True, 
    'temperature': 1 
}

From the preceding code, let’s understand the generation parameters:

guidance_scale – The guidance_scale is used in classifier-free guidance (CFG), setting the weighting between the conditional logits (predicted from the text prompts) and the unconditional logits (predicted from an unconditional or ‘null’ prompt). A higher guidance scale encourages the model to generate samples that are more closely linked to the input prompt, usually at the expense of poorer audio quality. CFG is enabled by setting guidance_scale > 1. For best results, use guidance_scale = 3. Our deployment defaults to 3.
max_new_tokens – The max_new_tokens parameter specifies the number of new tokens to generate. Generation is limited by the sinusoidal positional embeddings to 30-second inputs, meaning MusicGen can’t generate more than 30 seconds of audio (1,503 tokens). Our deployment defaults to 256.
do_sample – The model can generate an audio sample conditioned on a text prompt through use of the MusicgenProcessor to preprocess the inputs. The preprocessed inputs can then be passed to the .generate method to generate text-conditional audio samples. Our deployment defaults to True.
temperature – This is the softmax temperature parameter. A higher temperature increases the randomness of the output, making it more diverse. Our deployment defaults to 1.

Let’s look at how to build a prompt to infer the MusicGen model:

data = {
    "texts": [
        "Warm and vibrant weather on a sunny day, feeling the vibes of hip hop and synth",
    ],
    "bucket_name": sagemaker_session_bucket,
    "generation_params": generation_params
}

The preceding code is the payload, which will be saved as a JSON file and uploaded to an S3 bucket. We then provide the URI of the input payload during the asynchronous inference endpoint invocation along with other arguments as follows.

The texts key accepts an array of texts, which may contain the mood you want to reflect in your generated music. You can include musical instruments in the text prompt to the MusicGen model to generate music featuring those instruments.

The response from the invoke_endpoint_async is a dictionary of various parameters:

response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName=endpoint_name,
    InputLocation=input_s3_location,
    ContentType="application/json",
    InvocationTimeoutSeconds=3600
)

OutputLocation in the response metadata represents Amazon S3 URI where the inference response payload is stored.

Asynchronous music generation

As soon as the response metadata is sent to the client, the asynchronous inference begins the music generation. The music generation happens on the instance chosen during the deployment of the MusicGen model on the SageMaker asynchronous Inference endpoint , as detailed in the deployment section.

Continuous polling and obtaining music files

While the music generation is in progress, we continuously poll for the response metadata parameter OutputLocation:

from utils.inference_utils import get_output
output = get_output(sm_session, response.get('OutputLocation'))

The get_output function keeps polling for the presence of OutputLocation and returns the S3 URI of the .wav music file.

Audio output

Lastly, we download the files from Amazon S3 and play the output using the following logic:

from utils.inference_utils import play_output_audios
music_files = []
for s3_url in output.get('generated_outputs_s3'):
    if s3_url is not None:
        music_files.append(download_from_s3(s3_url))
play_output_audios(music_files, data.get('texts'))

You now have access to the .wav files and can try changing the generation parameters to experiment with various text prompts.

Audio-File-1

The following is another music sample based on the following generation parameters:

generation_params = { 'guidance_scale': 5, 'max_new_tokens': 1503, 'do_sample': True, 'temperature': 0.9 }
data = {
    "texts": [
        "Catchy funky beats with drums and bass, synthesized pop for an upbeat pop game",
    ],
    "bucket_name": sagemaker_session_bucket,
    "generation_params": generation_params
}

Audio-File-2

Clean up

To avoid incurring unnecessary charges, you can clean up using the following code:

import boto3
sagemaker_runtime = boto3.client('sagemaker-runtime')

cleanup = False # < - Set this to True to clean up resources.
endpoint_name = <Endpoint_Name>

sm_client = boto3.client('sagemaker')
endpoint = sm_client.describe_endpoint(EndpointName=endpoint_name)
endpoint_config_name = endpoint['EndpointConfigName']
endpoint_config = sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
model_name = endpoint_config['ProductionVariants'][0]['ModelName']
notification_config = endpoint_config['AsyncInferenceConfig']['OutputConfig'].get('NotificationConfig', None)
print(f"""
About to delete the following sagemaker resources:
Endpoint: {endpoint_name}
Endpoint Config: {endpoint_config_name}
Model: {model_name}
""")
for k,v in notification_config.items():
    print(f'About to delete SNS topics for {k} with ARN: {v}')

if cleanup:
    # delete endpoint
    sm_client.delete_endpoint(EndpointName=endpoint_name)
    # delete endpoint config
    sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
    # delete model
    sm_client.delete_model(ModelName=model_name)
    print('deleted model, config and endpoint')

The aforementioned cleanup routine will delete the SageMaker endpoint, endpoint configurations, and models associated with MusicGen model, so that you avoid incurring unnecessary charges. Make sure to set cleanup variable to True, and replace <Endpoint_Name> with the actual endpoint name of the MusicGen model deployed on SageMaker. Alternatively, you can use the console to delete the endpoints and its associated resources that were created while running the code mentioned in the post.

Conclusion

In this post, we learned how to use SageMaker asynchronous inference to deploy the AudioCraft MusicGen model. We started by exploring how the MusicGen models work and covered various use cases for deploying MusicGen models. We also explored how you can benefit from capabilities such as auto scaling and the integration of asynchronous endpoints with Amazon SNS to power downstream tasks. We then took a deep dive into the deployment and inference workflow of MusicGen models on SageMaker, using the AWS Deep Learning Containers for HuggingFace inference and the MusicGen model sourced from the Hugging Face Hub.

Get started with generating music using your creative prompts by signing up for AWS. The full source code is available on the official GitHub repository.

References

About the Authors

Pavan Kumar Rao Navule is a Solutions Architect at Amazon Web Services, where he works with ISVs in India to help them innovate on the AWS platform. He is specialized in architecting AI/ML and generative AI services at AWS. Pavan is a published author for the book “Getting Started with V Programming.” In his free time, Pavan enjoys listening to the great magical voices of Sia and Rihanna.

David John Chakram is a Principal Solutions Architect at AWS. He specializes in building data platforms and architecting seamless data ecosystems. With a profound passion for databases, data analytics, and machine learning, he excels at transforming complex data challenges into innovative solutions and driving businesses forward with data-driven insights.

Sudhanshu Hate is a principal AI/ML specialist with AWS and works with clients to advise them on their MLOps and generative AI journey. In his previous role before Amazon, he conceptualized, created, and led teams to build ground-up open source-based AI and gamification platforms, and successfully commercialized it with over 100 clients. Sudhanshu has to his credit a couple of patents, has written two books and several papers and blogs, and has presented his points of view in various technical forums. He has been a thought leader and speaker, and has been in the industry for nearly 25 years. He has worked with Fortune 1000 clients across the globe and most recently with digital native clients in India.

Rupesh Bajaj is a Solutions Architect at Amazon Web Services, where he collaborates with ISVs in India to help them leverage AWS for innovation. He specializes in providing guidance on cloud adoption through well-architected solutions and holds seven AWS certifications. With 5 years of AWS experience, Rupesh is also a Gen AI Ambassador. In his free time, he enjoys playing chess.

Build an end-to-end RAG solution using Knowledge Bases for Amazon Bedrock and AWS CloudFormation

August 5, 2024

by Sandeep Singh Amazon AWS

Retrieval Augmented Generation (RAG) is a state-of-the-art approach to building question answering systems that combines the strengths of retrieval and foundation models (FMs). RAG models first retrieve relevant information from a large corpus of text and then use a FM to synthesize an answer based on the retrieved information.

An end-to-end RAG solution involves several components, including a knowledge base, a retrieval system, and a generation system. Building and deploying these components can be complex and error-prone, especially when dealing with large-scale data and models.

This post demonstrates how to seamlessly automate the deployment of an end-to-end RAG solution using Knowledge Bases for Amazon Bedrock and AWS CloudFormation, enabling organizations to quickly and effortlessly set up a powerful RAG system.

Solution overview

The solution provides an automated end-to-end deployment of a RAG workflow using Knowledge Bases for Amazon Bedrock. We use AWS CloudFormation to set up the necessary resources, including :

An AWS Identity and Access Management (IAM) role
An Amazon OpenSearch Serverless collection and index
A knowledge base with its associated data source

The RAG workflow enables you to use your document data stored in an Amazon Simple Storage Service (Amazon S3) bucket and integrate it with the powerful natural language processing capabilities of FMs provided in Amazon Bedrock. The solution simplifies the setup process, allowing you to quickly deploy and start querying your data using the selected FM.

Prerequisites

To implement the solution provided in this post, you should have the following:

An active AWS account and familiarity with FMs, Amazon Bedrock, and OpenSearch Serverless.
An S3 bucket where your documents are stored in a supported format (.txt, .md, .html, .doc/docx, .csv, .xls/.xlsx, .pdf).
The Amazon Titan Embeddings G1-Text model enabled in Amazon Bedrock. You can confirm it’s enabled on the Model access page of the Amazon Bedrock console. If the Amazon Titan Embeddings G1-Text model is enabled, the access status will show as Access granted, as shown in the following screenshot.

Set up the solution

When the prerequisite steps are complete, you’re ready to set up the solution:

Clone the GitHub repository containing the solution files:

git clone https://github.com/aws-samples/amazon-bedrock-samples.git

Navigate to the solution directory:

cd knowledge-bases/features-examples/04-infrastructure/e2e-rag-deployment-using-bedrock-kb-cfn

Run the sh script, which will create the deployment bucket, prepare the CloudFormation templates, and upload the ready CloudFormation templates and required artifacts to the deployment bucket:

bash deploy.sh

While running deploy.sh, if you provide a bucket name as an argument to the script, it will create a deployment bucket with the specified name. Otherwise, it will use the default name format: e2e-rag-deployment-${ACCOUNT_ID}-${AWS_REGION}

As shown in the following screenshot, if you complete the preceding steps in an Amazon SageMaker notebook instance, you can run the bash deploy.sh at the terminal, which creates the deployment bucket in your account (account number has been redacted).

After the script is complete, note the S3 URL of the main-template-out.yml.

On the AWS CloudFormation console, create a new stack.
For Template source, select Amazon S3 URL and enter the URL you copied earlier.
Choose Next.

Provide a stack name and specify the RAG workflow details according to your use case and then choose Next.

Leave everything else as default and choose Next on the following pages.

Review the stack details and select the acknowledgement check boxes.

Choose Submit to start the deployment process.

You can monitor the stack deployment progress on the AWS CloudFormation console.

Test the solution

When the deployment is successful (which may take 7–10 minutes to complete), you can start testing the solution.

On the Amazon Bedrock console, navigate to the created knowledge base.
Choose Sync to initiate the data ingestion job.

After data synchronization is complete, select the desired FM to use for retrieval and generation (it requires model access to be granted to this FM in Amazon Bedrock before using).

Start querying your data using natural language queries.

That’s it! You can now interact with your documents using the RAG workflow powered by Amazon Bedrock.

Clean up

To avoid incurring future charges, delete the resources used in this solution:

On the Amazon S3 console, manually delete the contents inside the bucket you created for template deployment, then delete the bucket.
On the AWS CloudFormation console, choose Stacks in the navigation pane, select the main stack, and choose Delete.

Your created knowledge base will be deleted when you delete the stack.

Conclusion

In this post, we introduced an automated solution for deploying an end-to-end RAG workflow using Knowledge Bases for Amazon Bedrock and AWS CloudFormation. By using the power of AWS services and the preconfigured CloudFormation templates, you can quickly set up a powerful question answering system without the complexities of building and deploying individual components for RAG applications. This automated deployment approach not only saves time and effort, but also provides a consistent and reproducible setup, enabling you to focus on utilizing the RAG workflow to extract valuable insights from your data.

Try it out and see firsthand how it can streamline your RAG workflow deployment and enhance efficiency. Please share your feedback to us!

About the Authors

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. With a keen interest in exploring new frontiers in the field, she continuously strives to push boundaries. Outside of work, she loves traveling, working out, and exploring new things.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Faster LLMs with speculative decoding and AWS Inferentia2

August 5, 2024

by Syl Taylor Amazon AWS

In recent years, we have seen a big increase in the size of large language models (LLMs) used to solve natural language processing (NLP) tasks such as question answering and text summarization. Larger models with more parameters, which are in the order of hundreds of billions at the time of writing, tend to produce better results. For example, Llama-3-70B, scores better than its smaller 8B parameters version on metrics like reading comprehension (SQuAD 85.6 compared to 76.4). Thus, customers often experiment with larger and newer models to build ML-based products that bring value.

However, the larger the model, the more computationally demanding it is, and the higher the cost to deploy. For example, on AWS Trainium, Llama-3-70B has a median per-token latency of 21.4 ms, while Llama-3-8B takes 4.7 ms. Similarly, Llama-2-70B has a median per-token latency of 20.6 ms, while Llama-2-7B takes 3.7 ms. Customers have to consider performance to ensure they meet their users’ needs. In this blog post, we will explore how speculative sampling can help make large language model inference more compute efficient and cost-effective on AWS Inferentia and Trainium. This technique improves LLM inference throughput and output token latency (TPOT).

Introduction

Modern language models are based on the transformer architecture. The input prompts are processed first using a technique called context encoding, which runs fast because it is parallelizable. Next, we perform auto-regressive token generation where the output tokens are generated sequentially. Note that we cannot generate the next token until we know the previous one, as depicted in Figure 1. Therefore, to generate N output tokens we need N serial runs through the decoder. A run takes longer through a larger model, like Llama-3-70B, than through a smaller model, like Llama-3-8B.

AWS Neuron speculative decoding - Sequential token generation in LLMs

Figure 1: Sequential token generation in LLMs

From a computational perspective, token generation in LLMs is a memory bandwidth-bound process. The larger the model, the more likely it is that we will wait on memory transfers. This results in underutilizing the compute units and not fully benefiting from the floating-point operations (FLOPS) available.

Speculative sampling

Speculative sampling is a technique that improves the computational efficiency for running inference with LLMs, while maintaining accuracy. It works by using a smaller, faster draft model to generate multiple tokens, which are then verified by a larger, slower target model. This verification step processes multiple tokens in a single pass rather than sequentially and is more compute efficient than processing tokens sequentially. Increasing the number of tokens processed in parallel increases the compute intensity because a larger number of tokens can be multiplied with the same weight tensor. This provides better performance compared with the non-speculative run, which is usually memory bandwidth-bound, and thus leads to better hardware resource utilization.

The speculative process involves an adjustable window k, where the target model provides one guaranteed correct token, and the draft model speculates on the next k-1 tokens. If the draft model’s tokens are accepted, the process speeds up. If not, the target model takes over, ensuring accuracy.

AWS Neuron speculative decoding - Case when all speculated tokens are accepted

Figure 2: Case when all speculated tokens are accepted

Figure 2 illustrates a case where all speculated tokens are accepted, resulting in faster processing. The target model provides a guaranteed output token, and the draft model runs multiple times to produce a sequence of possible output tokens. These are verified by the target model and subsequently accepted by a probabilistic method.

AWS Neuron speculative decoding - Case when some speculated tokens are rejected

Figure 3: Case when some speculated tokens are rejected

On the other hand, Figure 3 shows a case where some of the tokens are rejected. The time it takes to run this speculative sampling loop is the same as in Figure 2, but we obtain fewer output tokens. This means we will be repeating this process more times to complete the response, resulting in slower overall processing.

By adjusting the window size k and understanding when the draft and target models are likely to produce similar results, we can maximize the benefits of speculative sampling.

A Llama-2-70B/7B demonstration

We will show how speculative sampling works on Inferentia2-powered Amazon EC2 Inf2 instances and Trainium-powered EC2 Trn1 instances. We will be using a sample where we generate text faster with Llama-2-70B by using a Llama-2-7B model as a draft model. The example walk-through is based on Llama-2 models, but you can follow a similar process for Llama-3 models as well.

Loading models

You can load the Llama-2 models using data type bfloat16. The draft model needs to be loaded in a standard way like in the example below. The parameter n_positions is adjustable and represents the maximum sequence length you want to allow for generation. The only batch_size we support for speculative sampling at the time of writing is 1. We will explain tp_degree later in this section.

draft_model = LlamaForSampling.from_pretrained('Llama-2-7b', n_positions=128, batch_size=1, tp_degree=32, amp='bf16')

The target model should be loaded in a similar way, but with speculative sampling functionality enabled. The value k was described previously.

target_model = LlamaForSampling.from_pretrained('Llama-2-70b', n_positions=128, batch_size=1, tp_degree=32, amp='bf16')
target_model.enable_speculative_decoder(k)

Combined, the two models need almost 200 GB of device memory for the weights with additional memory in the order of GBs needed for key-value (KV) caches. If you prefer to use the models with float32 parameters, they will need around 360 GB of device memory. Note that the KV caches grow linearly with sequence length (input tokens + tokens yet to be generated). Use neuron-top to see the memory utilization live. To accommodate for these memory requirements, we’ll need either the largest Inf2 instance (inf2.48xlarge) or largest Trn1 instance (trn1.32xlarge).

Because of the size of the models, their weights need to be distributed amongst the NeuronCores using a technique called tensor parallelism. Notice that in the sample provided, tp_degree is used per model to specify how many NeuronCores that model should use. This, in turn, affects the memory bandwidth utilization, which is critical for token generation performance. A higher tp_degree can lead to better bandwidth utilization and improved throughput. The topology for Trn1 requires that tp_degree is set to 1, 2, 8, 16 or a multiple of 32. For Inf2, it needs to be 1 or multiples of 2.

The order in which you load the models also matters. After a set of NeuronCores has been initialized and allocated for one model, you cannot use the same NeuronCores for another model unless it’s the exact same set. If you try to use only some of the NeuronCores that were previously initialized, you will get an nrt_load_collectives - global nec_comm is already init'd error.

Let’s go through two examples on trn1.32xlarge (32 NeuronCores) to understand this better. We will calculate how many NeuronCores we need per model. The formula used is the observed model size in memory, using neuron-top, divided by 16GB which is the device memory per NeuronCore.

If we run the models using bfloat16, we need more than 10 NeuronCores for Llama-2-70B and more than 2 NeuronCores for Llama-2-7B. Because of topology constraints, it means we need at least tp_degree=16 for Llama-2-70B. We can use the remaining 16 NeuronCores for Llama-2-7B. However, because both models fit in memory across 32 NeuronCores, we should set tp_degree=32 for both, to speed-up the model inference for each.
If we run the models using float32, we need more than 18 NeuronCores for Llama-2-70B and more than 3 NeuronCores for Llama-2-7B. Because of topology constraints, we have to set tp_degree=32 for Llama-2-70B. That means Llama-2-7B needs to re-use the same set of NeuronCores, so you need to set tp_degree=32 for Llama-2-7B too.

Walkthrough

The decoder we’ll use from transformers-neuronx is LlamaForSampling, which is suitable for loading and running Llama models. You can also use NeuronAutoModelForCausalLM which will attempt to auto-detect which decoder to use. To perform speculative sampling, we need to create a speculative generator first which takes two models and the value k described previously.

spec_gen = SpeculativeGenerator(draft_model, target_model, k)

We invoke the inferencing process by calling the following function:

spec_gen.sample(input_ids=input_token_ids, sequence_length=total_output_length)

During sampling, there are several hyper-parameters (for example: temperature, top_p, and top_k) that affect if the output is deterministic across multiple runs. At the time of writing, the speculative sampling implementation sets default values for these hyper-parameters. With these values, expect randomness in results when you run a model multiple times, even if it’s with the same prompt. This is normal intended behavior for LLMs because it improves their qualitative responses.

When you run the sample, you will use the default token acceptor, based on the DeepMind paper which introduced speculative sampling, which uses a probabilistic method to accept tokens. However, you can also implement a custom token acceptor, which you can pass as part of the acceptor parameter when you initialize the SpeculativeGenerator. You would do this if you wanted more deterministic responses, for example. See the implementation of the DefaultTokenAcceptor class in transformers-neuronx to understand how to write your own.

Conclusion

As more developers look to incorporate LLMs into their applications, they’re faced with a choice of using larger, more costly, and slower models that will deliver higher quality results. Or they can use smaller, less expensive and faster models that might reduce quality of answers. Now, with AWS artificial intelligence (AI) chips and speculative sampling, developers don’t have to make that choice. They can take advantage of the high-quality outputs of larger models and the speed and responsiveness of smaller models.

In this blog post, we have shown that we can accelerate the inference of large models, such as Llama-2-70B, by using a new feature called speculative sampling.

To try it yourself, check out the speculative sampling example, and tweak the input prompt and k parameter to see the results you get. For more advanced use cases, you can develop your own token acceptor implementation. To learn more about running your models on Inferentia and Trainium instances, see the AWS Neuron documentation. You can also visit repost.aws AWS Neuron channel to discuss your experimentations with the AWS Neuron community and share ideas.

About the Authors

Syl Taylor is a Specialist Solutions Architect for Efficient Compute. She advises customers across EMEA on Amazon EC2 cost optimization and improving application performance using AWS-designed chips. Syl previously worked in software development and AI/ML for AWS Professional Services, designing and implementing cloud native solutions. She’s based in the UK and loves spending time in nature.

Emir Ayar is a Senior Tech Lead Solutions Architect with the AWS Prototyping team. He specializes in assisting customers with building ML and generative AI solutions, and implementing architectural best practices. He supports customers in experimenting with solution architectures to achieve their business objectives, emphasizing agile innovation and prototyping. He lives in Luxembourg and enjoys playing synthesizers.

Catalog, query, and search audio programs with Amazon Transcribe and Knowledge Bases for Amazon Bedrock

August 5, 2024

by Nolan Chen Amazon AWS

Information retrieval systems have powered the information age through their ability to crawl and sift through massive amounts of data and quickly return accurate and relevant results. These systems, such as search engines and databases, typically work by indexing on keywords and fields contained in data files.

However, much of our data in the digital age also comes in non-text format, such as audio and video files. Finding relevant content usually requires searching through text-based metadata such as timestamps, which need to be manually added to these files. This can be hard to scale as the volume of unstructured audio and video files continues to grow.

Fortunately, the rise of artificial intelligence (AI) solutions that can transcribe audio and provide semantic search capabilities now offer more efficient solutions for querying content from audio files at scale. Amazon Transcribe is an AWS AI service that makes it straightforward to convert speech to text. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

In this post, we show how Amazon Transcribe and Amazon Bedrock can streamline the process to catalog, query, and search through audio programs, using an example from the AWS re:Think podcast series.

Solution overview

The following diagram illustrates how you can use AWS services to deploy a solution for cataloging, querying, and searching through content stored in audio files.

In this solution, audio files stored in mp3 format are first uploaded to Amazon Simple Storage Service (Amazon S3) storage. Video files (such as mp4) that contain audio in supported languages can also be uploaded to Amazon S3 as part of this solution. Amazon Transcribe will then transcribe these files and store the entire transcript in JSON format as an object in Amazon S3.

To catalog these files, each JSON file in Amazon S3 should be tagged with the corresponding episode title. This allows us to later retrieve the episode title for each query result.

Next, we use Amazon Bedrock to create numerical representations of the content inside each file. These numerical representations are also called embeddings, and they’re stored as vectors inside a vector database that we can later query.

Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API. Included with Amazon Bedrock is Knowledge Bases for Amazon Bedrock. As a fully managed service, Knowledge Bases for Amazon Bedrock makes it straightforward to set up a Retrieval Augmented Generation (RAG) workflow.

With Knowledge Bases for Amazon Bedrock, we first set up a vector database on AWS. Knowledge Bases for Amazon Bedrock can then automatically split the data files stored in Amazon S3 into chunks and then create embeddings of each chunk using Amazon Titan on Amazon Bedrock. Amazon Titan is a family of high-performing FMs from Amazon. Included with Amazon Titan is Amazon Titan Text Embeddings, which we use to create the numerical representation of the text inside each chunk and store them in a vector database.

When a user queries the contents of the audio files through a generative AI application or AWS Lambda function, it makes an API call to Knowledge Bases for Amazon Bedrock. Knowledge Bases for Amazon Bedrock will then orchestrate a call to the vector database to perform a semantic search, which returns the most relevant results. Next, Knowledge Bases for Amazon Bedrock augments the user’s original query with these results to a prompt, which is sent to the large language model (LLM). The LLM will return results that are more accurate and relevant to the user query.

Let’s walk through an example of how you can catalog, query, and search through a library of audio files using these AWS AI services. For this post, we use episodes of the re:Think podcast series, which has over 20 episodes. Each episode is an audio program recorded in mp3 format. As we continue to add new episodes, we will want to use AI services to make the task of querying and searching for specific content more scalable without the need to manually add metadata for each episode.

Prerequisites

In addition to having access to AWS services through the AWS Management Console, you need a few other resources to deploy this solution.

First, you need a library of audio files to catalog, query, and search. For this post, we use episodes of the AWS re:Think podcast series.

To make API calls to Amazon Bedrock from our generative AI application, we use Python version 3.11.4 and the AWS SDK for Python (Boto3).

Transcribe audio files

The first task is to transcribe each mp3 file using Amazon Transcribe. For instructions on transcribing with the AWS Management Console or AWS CLI, refer to the Amazon Transcribe Developer guide. Amazon Transcribe can create a transcript for each episode and store it as an S3 object in JSON format.

Catalog audio files using tagging

To catalog each episode, we tag the S3 object for each episode with the corresponding episode title. For instructions on tagging objects in S3, refer to the Amazon Simple Storage Service User Guide. For example, for the S3 object AI-Accelerators.json, we tag it with key = “title” and value = “Episode 20: AI Accelerators in the Cloud.”

The title is the only metadata we need to manually add for each audio file. There is no need to manually add timestamps for each chapter or section in order to later search for specific content.

Set up a vector database using Knowledge Bases for Amazon Bedrock

Next, we set up our fully managed RAG workflow using Knowledge Bases for Amazon Bedrock. For instructions on creating a knowledge base, refer to the Amazon Bedrock User Guide. We begin by specifying a data source. In our case, we choose the S3 bucket location where our transcripts in JSON format are stored.

Next, we select an embedding model. The embedding model will convert each chunk of our transcript into embeddings. Embeddings are numbers, and the meaning of each embedding depends on the model. In our example, we select Titan Text Embeddings v2 with a dimension size of 1024.

The embeddings are stored as vectors in a vector database. You can either specify an existing vector database you have already created or have Knowledge Bases for Amazon Bedrock create one for you. For our example, we have Knowledge Bases for Amazon Bedrock create a vector database using Amazon OpenSearch Serverless.

Before you can query the vector database, you must first sync it with the data source. During each sync operation, Knowledge Bases for Amazon Bedrock will split the data source into chunks and then use the selected embedding model to embed each chunk as a vector. Knowledge Bases for Amazon Bedrock will then store these vectors in the vector database.

The sync operation as well as other Amazon Bedrock operations described so far can be performed either using the console or API calls.

Query the audio files

Now we’re ready to query and search for specific content from our library of podcast episodes. In episode 20, titled “AI Accelerators in the Cloud,” our guest Matthew McClean, a senior manager from AWS’s Annapurna team, shared why AWS decided to buy Annapurna Labs in 2015. For our first query, we ask, “Why did AWS acquire Annapurna Labs?”

We entered this query into Knowledge Bases for Amazon Bedrock using Anthropic Claude and got the following response:

“AWS acquired Annapurna Labs in 2015 because Annapurna was providing AWS with nitro cards that offloaded virtualization, security, networking and storage from EC2 instances to free up CPU resources.”

This is an exact quote from Matthew McClean in the podcast episode. You wouldn’t get this quote if you had entered the same prompt into other publicly available generative AI chatbots because they don’t have the vector database with embeddings of the podcast transcript to provide more relevant context.

Retrieve an episode title

Now let’s suppose that in addition to getting more relevant responses, we also want to retrieve the correct podcast episode title that was relevant to this query from our catalog of podcast episodes.

To retrieve the episode title, we first use the most relevant data chunk from the query. Whenever Knowledge Bases for Amazon Bedrock responds to a query, it also provides one or more chunks of data that it retrieved from the vector database that were most relevant to the query in order of relevance. We can take the first chunk that was returned. These chunks are returned as JSON documents. Nested inside the JSON is the S3 location of the transcript object. In our example, the S3 location is s3://rethinkpodcast/text/transcripts/AI-Accelerators.json.

The first words in the chunk text are: “Yeah, sure. So maybe I can start with the history of Annapurna…”

Because we have already tagged this transcript object in Amazon S3 with the episode title, we can retrieve the title by retrieving the value of the tag where key = “title”. In this case, the title is “Episode 20: AI Accelerators in the Cloud.”

Search the start time

What if we also want to search and find the start time inside the episode where the relevant content begins? We want to do so without having to manually read through the transcript or listen to the episode from the beginning, and without manually adding timestamps for every chapter.

We can find the start time much faster by having our generative AI application make a few more API calls. We start by treating the chunk text as a substring of the entire transcript. We then search for the start time of the first word in the chunk text.

In our example, the first words returned were “Yeah, sure. So maybe I can start with the history of Annapurna…” We now need to search the entire transcript for the start time of the word “Yeah.”

Amazon Transcribe outputs the start time of every word in the transcript. However, any word can appear more than once. The word “Yeah” occurs 28 times in the transcript, and each occurrence has its own start time. So how do we determine the correct start time for “Yeah” in our example?

There are multiple approaches an application developer can use to find the correct start time. For our example, we use the Python string find() method to find the position of the chunk text within the entire transcript.

For the chunk text that begins with “Yeah, sure. So maybe I can start with the history of Annapurna…” the find() method returned the position as 2047. If we treat the transcript as one long text string, the chunk “Yeah, sure. So maybe…” starts at character position 2047.

Finding the start time now becomes a matter of counting the character position of each word in the transcript and using it to look up the correct start time from the transcript file generated by Amazon Transcribe. This may be tedious for a person to do manually, but trivial for a computer.

In our example Python code, we loop through an array that contains the start time for each token while counting the number of the character position that each token starts at. Because we’re looping through the tokens, we can build a new array that stores the start time for each character position.

In this example query, the start time for the word “Yeah” at position 2047 is 160 seconds, or 2 minutes and 40 seconds into the podcast. You can check the recording starting at 2 minutes 40 seconds.

Clean up

This solution incurs charges based on the services you use:

Amazon Transcribe operates under a pay-as-you-go pricing model. For more details, see Amazon Transcribe Pricing.
Amazon Bedrock uses an on-demand quota, so you only pay for what you use. For more information, refer to Amazon Bedrock pricing.
With OpenSearch Serverless, you only pay for the resources consumed by your workload.
If you’re using Knowledge Bases for Amazon Bedrock with other vector databases besides OpenSearch Serverless, you may continue to incur charges even when not running any queries. It is recommended you delete your knowledge base and its associated vector store along with audio files stored in Amazon S3 to avoid unnecessary costs when you’re done testing this solution.

Conclusion

Cataloging, querying, and searching through large volumes of audio files can be difficult to scale. In this post, we showed how Amazon Transcribe and Knowledge Bases for Amazon Bedrock can help automate and make the process of retrieving relevant information from audio files more scalable.

You can begin transcribing your own library of audio files with Amazon Transcribe. To learn more on how Knowledge Bases for Amazon Bedrock can then orchestrate a RAG workflow for your transcripts with vector stores, refer to Knowledge Bases now delivers fully managed RAG experience in Amazon Bedrock.

With the help of these AI services, we can now expand the frontiers of our knowledge bases.

About the Author

Nolan Chen is a Partner Solutions Architect at AWS, where he helps startup companies build innovative solutions using the cloud. Prior to AWS, Nolan specialized in data security and helping customers deploy high-performing wide area networks. Nolan holds a bachelor’s degree in Mechanical Engineering from Princeton University.

Cepsa Química improves the efficiency and accuracy of product stewardship using Amazon Bedrock

August 2, 2024

by Vicente Cruz Mínguez Amazon AWS

This is a guest post co-written with Vicente Cruz Mínguez, Head of Data and Advanced Analytics at Cepsa Química, and Marcos Fernández Díaz, Senior Data Scientist at Keepler.

Generative artificial intelligence (AI) is rapidly emerging as a transformative force, poised to disrupt and reshape businesses of all sizes and across industries. Generative AI empowers organizations to combine their data with the power of machine learning (ML) algorithms to generate human-like content, streamline processes, and unlock innovation. As with all other industries, the energy sector is impacted by the generative AI paradigm shift, unlocking opportunities for innovation and efficiency. One of the areas where generative AI is rapidly showing its value is the streamlining of operational processes, reducing costs, and enhancing overall productivity.

In this post, we explain how Cepsa Química and partner Keepler have implemented a generative AI assistant to increase the efficiency of the product stewardship team when answering compliance queries related to the chemical products they market. To accelerate development, they used Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy and safety.

Cepsa Química, a world leader in the manufacturing of linear alkylbenzene (LAB) and ranking second in the production of phenol, is a company aligned with Cepsa’s Positive Motion strategy for 2030, contributing to the decarbonization and sustainability of its processes through the use of renewable raw materials, development of products with less carbon, and use of waste as raw materials.

At Cepsa’s Digital, IT, Transformation & Operational Excellence (DITEX) department, we work on democratizing the use of AI within our business areas so that it becomes another lever for generating value. Within this context, we identified product stewardship as one of the areas with more potential for value creation through generative AI. We partnered with Keepler, a cloud-centered data services consulting company specialized in the design, construction, deployment, and operation of advanced public cloud analytics custom-made solutions for large organizations, in the creation of the first generative AI solution for one of our corporate teams.

The Safety, Sustainability & Energy Transition team

The Safety, Sustainability & Energy Transition area of Cepsa Química is responsible for all human health, safety, and environmental aspects related to the products manufactured by the company and the associated raw materials, among others. In this field, its areas of action are product safety, regulatory compliance, sustainability, and customer service around safety and compliance.

One of the responsibilities of the Safety, Sustainability & Energy Transition team is product stewardship, which takes care of regulatory compliance of the marketed products. The Product Stewardship department is responsible for managing a large collection of regulatory compliance documents. Their duty involves determining which regulations apply to each specific product in the company’s portfolio, compiling a list of all the applicable regulations for a given product, and supporting other internal teams that might have questions related to these products and regulations. Example questions might be “What are the restrictions for CMR substances?”, “How long do I need to keep the documents related to a toluene sale?”, or “What is the reach characterization ratio and how do I calculate it?” The regulatory content required to answer these questions varies over time, introducing new clauses and repealing others. This work used to consume a significant percentage of the team’s time, so they identified an opportunity to generate value by reducing the search time for regulatory consultations.

The DITEX department engaged with the Safety, Sustainability & Energy Transition team for a preliminary analysis of their pain points and deemed it feasible to use generative AI techniques to speed up the resolution of compliance queries faster. The analysis was conducted for queries based on both unstructured (regulatory documents and product specs sheets) and structured (product catalog) data.

An approach to product stewardship with generative AI

Large language models (LLMs) are trained with vast amounts of information crawled from the internet, capturing considerable knowledge from multiple domains. However, their knowledge is static and tied to the data used during the pre-training phase.

To overcome this limitation and provide dynamism and adaptability to knowledge base changes, we decided to follow a Retrieval Augmented Generation (RAG) approach, in which the LLMs are presented with relevant information extracted from external data sources to provide up-to-date data without the need to retrain the models. This approach is a great fit for a scenario where regulatory information is updated at a fast pace, with frequent derogations, amendments, and new regulations being published.

Additionally, the RAG-based approach enables rapid prototyping of document search use cases, allowing us to craft a solution based on regulatory information about chemical substances in a few weeks.

The solution we built is based on four main functional blocks:

Input processing – Input regulatory PDF documents are preprocessed to extract the relevant information. Each document is divided into chunks to ease the indexing and retrieval processes based on semantic meaning.
Embeddings generation – An embeddings model is used to encode the semantic information of each chunk into an embeddings vector, which is stored in a vector database, enabling similarity search of user queries.
LLM chain service – This service orchestrates the solution by invoking the LLM models with a fitting prompt and creating the response that is returned to the user.
User interface – A conversational chatbot enables interaction with users.

We divided the solution into two independent modules: one to batch process input documents and another one to answer user queries by running inference.

Batch ingestion module

The batch ingestion module performs the initial processing of the raw compliance documents and product catalog and generates the embeddings that will be later used to answer user queries. The following diagram illustrates this architecture.

The batch ingestion module performs the following tasks:

AWS Glue, a serverless data integration service, is used to run periodical extract, transform, and load (ETL) jobs that read input raw documents and the product catalog from Amazon Simple Storage Service (Amazon S3), an object storage service that offers industry-leading scalability, data availability, security, and performance.
The AWS Glue job calls Amazon Textract, an ML service that automatically extracts text, handwriting, layout elements, and data from scanned documents, to process the input PDF documents. After data is extracted, the job performs document chunking, data cleanup, and postprocessing.
The AWS Glue job uses Amazon Bedrock to generate vector embeddings for each document chunk using the Amazon Titan Text Embeddings
Amazon Aurora PostgreSQL-Compatible Edition, a fully managed, PostgreSQL-compatible, and ACID-compliant relational database engine to store the extracted embeddings, is used with the pgvector extension enabled for efficient similarity searches.

Inference module

The inference module transforms user queries into embeddings, retrieves relevant document chunks from the knowledge base using similarity search, and prompts an LLM with the query and retrieved chunks to generate a contextual response. The following diagram illustrates this architecture.

The inference module implements the following steps:

Users interact through a web portal, which consists of a static website stored in Amazon S3, served through Amazon CloudFront, a content delivery network (CDN), and secured with AWS Cognito, a customer identity and access management platform.
Queries are sent to the backend using a REST API defined in Amazon API Gateway, a fully managed service that makes it straightforward for developers to create, publish, maintain, monitor, and secure APIs at any scale, and implemented through an API Gateway private integration. The backend is implemented by an LLM chain service running on AWS Fargate, a serverless, pay-as-you-go compute engine that lets you focus on building applications without managing servers. This service orchestrates the interaction with the different LLMs using the LangChain
The LLM chain service invokes Amazon Titan Text Embeddings on Amazon Bedrock to generate the embeddings for the user query.
Based on the query embeddings, the relevant documents are retrieved from the embeddings database using similarity search.
The service composes a prompt that includes the user query and the documents extracted from the knowledge base. The prompt is sent to Anthropic Claude 2.0 on Amazon Bedrock, and the model answer is sent back to the user.

Note on the RAG implementation

The product stewardship chatbot was built before Knowledge Bases for Amazon Bedrock was generally available. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows. Knowledge Bases manages the initial vector store set up, handles the embedding and querying, and provides source attribution and short-term memory needed for production RAG applications.

With Knowledge Bases for Amazon Bedrock, the implementation of steps 3–4 of the Batch Ingestion and Inference modules can be significantly simplified.

Challenges and solutions

In this section, we discuss the challenges we encountered during the development of the system and the decisions we made to overcome those challenges.

Data preprocessing and chunking strategy

We discovered that the input documents contained a variety of structural complexities, which posed a challenge in the processing stage. For instance, some tables contain large amounts of information with minimal context except for the header, which is displayed at the top of the table. This can make it complex to obtain the right answers to user queries, because the retrieval process might lack context.

Additionally, some document annexes are linked to other sections of the document or even other documents, leading to incomplete data retrieval and generation of inaccurate answers.

To address these challenges, we implemented three mitigation strategies:

Data chunking – We decided to use larger chunk sizes with significant overlaps to provide maximum context for each chunk during ingestion. However, we set an upper limit to avoid losing the semantic meaning of the chunk.
Model selection – We selected a model with a large context window to generate responses that take a larger context into account. Anthropic Claude 2.0 on Amazon Bedrock, with a 100 K context window, provided the most accurate results. (The system was built before Anthropic Claude 2.1 or the Anthropic Claude 3 model family were available on Amazon Bedrock).
Query variants – Prior to retrieving documents from the database, multiple variants of the user query are generated using an LLM. Documents for all variants are retrieved and deduplicated before being provided as context for the LLM query.

These three strategies significantly enhanced the retrieval and response accuracy of the RAG system.

Evaluation of results and process refinement

Evaluating the responses from the LLM models is another challenge that is not found in traditional AI use cases. Because of the free text nature of the output, it’s difficult to assess and compare different responses in terms of a metric or KPI, leading to a manual review in most cases. However, a manual process is time-consuming and not scalable.

To minimize the drawbacks, we created a benchmarking dataset with the help of seasoned users, containing the following information:

Representative questions that require data combined from different documents
Ground truth answers for each question
References to the source documents, pages, and line numbers where the right answers are found

Then we implemented an automatic evaluation system with Anthropic Claude 2.0 on Amazon Bedrock, with different prompting strategies to evaluate document retrieval and response formation. This approach allowed for adjustment of different parameters in a fast and automated manner:

Preprocessing – Tried different values for chunk size and overlap size
Retrieval – Tested several retrieval techniques of incremental complexity
Querying – Ran the tests with different LLMs hosted on Amazon Bedrock:
- Amazon Titan Text Premier
- Cohere Command v1.4
- Anthropic Claude Instant
- Anthropic Claude 2.0

The final solution consists of three chains: one for translating the user query into English, one for generating variations of the input question, and one for composing the final response.

Achieved improvements and next steps

We built a conversational interface for the Safety, Sustainability & Energy Transition team that helps the product stewardship team be more efficient and obtain answers to compliance queries faster. Furthermore, the answers contain references to the input documents used by the LLM to generate the reply, so the team can double-check the response and find additional context if it’s needed. The following screenshot shows an example of the conversational interface.

Some of the qualitative and quantitative improvements identified by the product stewardship team through the use of the solution are:

Query times – The following table summarizes the search time saved by query complexity and user seniority (considering all search times have been reduced to less than 1 minute).

Complexity	Time saved (minutes)
Complexity	Junior user	Senior user
Low	3.3	2
Medium	9.25	4
High	28	10

Answer quality – The implemented system offers additional context and document references that are used by the users to improve the quality of the answer.
Operational efficiency – The implemented system has accelerated the regulatory query process, directly enhancing the department operational efficiency.

From the DITEX department, we’re currently working with other business areas at Cepsa Química to identify similar use cases to help create a corporate-wide tool that reuses components from this first initiative and generalizes the use of generative AI across business functions.

Conclusion

In this post, we shared how Cepsa Química and partner Keepler have implemented a generative AI assistant that uses Amazon Bedrock and RAG techniques to process, store, and query the corpus of knowledge related to product stewardship. As a result, users save up to 25 percent of their time when they use the assistant to solve compliance queries.

If you want your business to get started with generative AI, visit Generative AI on AWS and connect with a specialist, or quickly build a generative AI application in PartyRock.

About the authors

Vicente Cruz Mínguez is the Head of Data & Advanced Analytics at Cepsa Química. He has more than 8 years of experience with big data and machine learning projects in financial, retail, energy, and chemical industries. He is currently leading the Data, Advanced Analytics & Cloud Development team in the Digital, IT, Transformation & Operational Excellence department at Cepsa Química, with a focus in feeding the corporate data lake and democratizing data for analysis, machine learning projects, and business analytics. Since 2023, he has also been working on scaling the use of generative AI in all departments.

Marcos Fernández Díaz is a Senior Data Scientist at Keepler, with 10 years of experience developing end-to-end machine learning solutions for different clients and domains, including predictive maintenance, time series forecasting, image classification, object detection, industrial process optimization, and federated machine learning. His main interests include natural language processing and generative AI. Outside of work, he is a travel enthusiast.

Guillermo Menéndez Corral is a Sr. Manager, Solutions Architecture at AWS for Energy and Utilities. He has over 18 years of experience designing and building software products and currently helps AWS customers in the energy industry harness the power of the cloud through innovation and modernization.

GraphStorm 0.3: Scalable, multi-task learning on graphs with user-friendly APIs

August 2, 2024

by Xiang Song Amazon AWS

GraphStorm is a low-code enterprise graph machine learning (GML) framework to build, train, and deploy graph ML solutions on complex enterprise-scale graphs in days instead of months. With GraphStorm, you can build solutions that directly take into account the structure of relationships or interactions between billions of entities, which are inherently embedded in most real-world data, including fraud detection scenarios, recommendations, community detection, and search/retrieval problems.

Today, we are launching GraphStorm 0.3, adding native support for multi-task learning on graphs. Specifically, GraphStorm 0.3 allows you to define multiple training targets on different nodes and edges within a single training loop. In addition, GraphStorm 0.3 adds new APIs to customize GraphStorm pipelines: you now only need 12 lines of code to implement a custom node classification training loop. To help you get started with the new API, we have published two Jupyter notebook examples: one for node classification, and one for a link prediction task. We also released a comprehensive study of co-training language models (LM) and graph neural networks (GNN) for large graphs with rich text features using the Microsoft Academic Graph (MAG) dataset from our KDD 2024 paper. The study showcases the performance and scalability of GraphStorm on text rich graphs and the best practices of configuring GML training loops for better performance and efficiency.

Native support for multi-task learning on graphs

Many enterprise applications have graph data associated with multiple tasks on different nodes and edges. For example, retail organizations want to conduct fraud detection on both sellers and buyers. Scientific publishers want to find more related works to cite in their papers and need to select the right subject for their publication to be discoverable. To better model such applications, customers have asked us to support multi-task learning on graphs.

GraphStorm 0.3 supports multi-task learning on graphs with six most common tasks: node classification, node regression, edge classification, edge regression, link prediction, and node feature reconstruction. You can specify the training targets through a YAML configuration file. For example, a scientific publisher can use the following YAML configuration to simultaneously define a paper subject classification task on paper nodes and a link prediction task on paper-citing-paper edges for the scientific publisher use case:

version: 1.0
    gsf:
        basic: # basic settings of the backbone GNN model
            ...
        ...
        multi_task_learning:
            - node_classification:         # define a node classification task for paper subject prediction.
                target_ntype: "paper"      # the paper nodes are the training targets.
                label_field: "label_class" # the node feature "label_class" contains the training labels.
				mask_fields:
                    - "train_mask_class"   # train mask is named as train_mask_class.
                    - "val_mask_class"     # validation mask is named as val_mask_class.
                    - "test_mask_class"    # test mask is named as test_mask_class.
                num_classes: 10            # There are total 10 different classes (subject) to predict.
                task_weight: 1.0           # The task weight is 1.0.
                
            - link_prediction:                # define a link prediction paper citation recommendation.
                num_negative_edges: 4         # Sample 4 negative edges for each positive edge during training
                num_negative_edges_eval: 100  # Sample 100 negative edges for each positive edge during evaluation
                train_negative_sampler: joint # Share the negative edges between positive edges (to speedup training)
                train_etype:
                    - "paper,citing,paper"    # The target edge type for link prediction training is "paper, citing, paper"
                mask_fields:
                    - "train_mask_lp"         # train mask is named as train_mask_lp.
                    - "val_mask_lp"           # validation mask is named as val_mask_lp.
                    - "test_mask_lp"          # test mask is named as test_mask_lp.
                task_weight: 0.5              # The task weight is 0.5.

For more details about how to run graph multi-task learning with GraphStorm, refer to Multi-task Learning in GraphStorm in our documentation.

New APIs to customize GraphStorm pipelines and components

Since GraphStorm’s release in early 2023, customers have mainly used its command line interface (CLI), which abstracts away the complexity of the graph ML pipeline for you to quickly build, train, and deploy models using common recipes. However, customers are telling us that they want an interface that allows them to customize the training and inference pipeline of GraphStorm to their specific requirements more easily. Based on customer feedback for the experimental APIs we released in GraphStorm 0.2, GraphStorm 0.3 introduces refactored graph ML pipeline APIs. With the new APIs, you only need 12 lines of code to define a custom node classification training pipeline, as illustrated by the following example:

import graphstorm as gs
gs.initialize()

acm_data = gs.dataloading.GSgnnData(part_config='./acm_gs_1p/acm.json')

train_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_train_set(ntypes=['paper']), fanout=[20, 20], batch_size=64)
val_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_val_set(ntypes=['paper']), fanout=[100, 100], batch_size=256, train_task=False)
test_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_test_set(ntypes=['paper']), fanout=[100, 100], batch_size=256, train_task=False)

model = RgcnNCModel(g=acm_data.g, num_hid_layers=2, hid_size=128, num_classes=14)
evaluator = gs.eval.GSgnnClassificationEvaluator(eval_frequency=100)

trainer = gs.trainer.GSgnnNodePredictionTrainer(model)
trainer.setup_evaluator(evaluator)

trainer.fit(train_dataloader, val_dataloader, test_dataloader, num_epochs=5)

To help you get started with the new APIs, we also have released new Jupyter notebook examples in our Documentation and Tutorials page.

Comprehensive study of LM+GNN for large graphs with rich text features

Many enterprise applications have graphs with text features. In retail search applications, for example, shopping log data provides insights on how text-rich product descriptions, search queries, and customer behavior are related. Foundational large language models (LLMs) alone are not suitable to model such data because the underlying data distributions and relationships don’t correspond to what LLMs learn from their pre-training data corpuses. GML, on the other hand, is great for modeling related data (graphs) but until now, GML practitioners had to manually combine their GML models with LLMs to model text features and get the best performance for their use cases. Especially when the underlying graph dataset was large, this manual work was challenging and time-consuming.

In GraphStorm 0.2, GraphStorm introduced built-in techniques to train language models (LMs) and GNN models together efficiently at scale on massive text-rich graphs. Since then, customers have been asking us for guidance on how GraphStorm’s LM+GNN techniques should be employed to optimize performance. To address this, with GraphStorm 0.3, we released a LM+GNN benchmark using the large graph dataset, Microsoft Academic Graph (MAG), on two standard graph ML tasks: node classification and link prediction. The graph dataset is a heterogeneous graph, contains hundreds of millions of nodes and billions of edges, and the majority of nodes are attributed with rich text features. The detailed statistics of the datasets are shown in the following table.

Dataset	Num. of nodes	Num. of edges	Num. of node/edge types	Num. of nodes in NC training set	Num. of edges in LP training set	Num. of nodes with text-features
MAG	484,511,504	7,520,311,838	4/4	28,679,392	1,313,781,772	240,955,156

We benchmark two main LM-GNN methods in GraphStorm: pre-trained BERT+GNN, a baseline method that is widely adopted, and fine-tuned BERT+GNN, introduced by GraphStorm developers in 2022. With the pre-trained BERT+GNN method, we first use a pre-trained BERT model to compute embeddings for node text features and then train a GNN model for prediction. With the fine-tuned BERT+GNN method, we initially fine-tune the BERT models on the graph data and use the resulting fine-tuned BERT model to compute embeddings that are then used to train a GNN models for prediction. GraphStorm provides different ways to fine-tune the BERT models, depending on the task types. For node classification, we fine-tune the BERT model on the training set with the node classification tasks; for link prediction, we fine-tune the BERT model with the link prediction tasks. In the experiment, we use 8 r5.24xlarge instances for data processing and use 4 g5.48xlarge instances for model training and inference. The fine-tuned BERT+GNN approach has up to 40% better performance (link prediction on MAG) compared to pre-trained BERT+GNN.

The following table shows the model performance of the two methods and the overall computation time of the whole pipeline starting from data processing and graph construction. NC means node classification and LP means link prediction. LM Time Cost means the time spent on computing BERT embeddings and the time spent on fine-tuning the BERT models for pre-trained BERT+GNN and fine-tuned BERT+GNN, respectively.

Dataset	Task	Data processing time	Target	Pre-trained BERT + GNN			Fine-tuned BERT + GNN
Dataset	Task	Data processing time	Target	LM Time Cost	One epoch time	Metric	LM Time Cost	One epoch time	Metric
MAG	NC	553 min	paper subject	206 min	135 min	Acc:0.572	1423 min	137 min	Acc:0.633
MAG	LP	553 min	cite	198 min	2195 min	Mrr: 0.487	4508 min	2172 min	Mrr: 0.684

We also benchmark GraphStorm on large synthetic graphs to showcase its scalability. We generate three synthetic graphs with 1 billion, 10 billion, and 100 billion edges. The corresponding training set sizes are 8 million, 80 million, and 800 million, respectively. The following table shows the computation time of graph preprocessing, graph partition, and model training. Overall, GraphStorm enables graph construction and model training on 100 billion scale graphs within hours!

Graph Size	Data pre-process		Graph Partition		Model Training
Graph Size	# instances	Time	# instances	Time	# instances	Time
1B	4	19 min	4	8 min	4	1.5 min
10B	8	31 min	8	41 min	8	8 min
100B	16	61 min	16	416 min	16	50 min

More benchmark details and results are available in our KDD 2024 paper.

Conclusion

GraphStorm 0.3 is published under the Apache-2.0 license to help you tackle your large-scale graph ML challenges, and now offers native support for multi-task learning and new APIs to customize pipelines and other components of GraphStorm. Refer to the GraphStorm GitHub repository and documentation to get started.

About the Author

Xiang Song is a senior applied scientist at AWS AI Research and Education (AIRE), where he develops deep learning frameworks including GraphStorm, DGL and DGL-KE. He led the development of Amazon Neptune ML, a new capability of Neptune that uses graph neural networks for graphs stored in graph database. He is now leading the development of GraphStorm, an open-source graph machine learning framework for enterprise use cases. He received his Ph.D. in computer systems and architecture at the Fudan University, Shanghai, in 2014.

Jian Zhang is a senior applied scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, USA, and Singapore. As an enlightener of AWS’s graph capabilities, Zhang has given many public presentations about the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.

Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting science teams like the graph machine learning group, and ML Systems teams working on large scale distributed training, inference, and fault resilience. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems/robotics scientist – a field in which he holds a phd.

Few-shot prompt engineering and fine-tuning for LLMs in Amazon Bedrock

August 2, 2024

by Sovik Nath Amazon AWS

This blog is part of the series, Generative AI and AI/ML in Capital Markets and Financial Services.

Company earnings calls are crucial events that provide transparency into a company’s financial health and prospects. Earnings reports detail a firm’s financials over a specific period, including revenue, net income, earnings per share, balance sheet, and cash flow statement. Earnings calls are live conferences where executives present an overview of results, discuss achievements and challenges, and provide guidance for upcoming periods.

These disclosures are vitally important for capital markets, significantly impacting stock prices. Investors and analysts closely watch key metrics like revenue growth, earnings per share, margins, cash flow, and projections to assess performance against peers and industry trends. The rate of growth and profit margins influence the premium and multiplier that investors are willing to pay for a company’s stock, ultimately affecting stock returns and price movements.

Earnings calls also allow investors to look for new clues about a company’s future. Companies often release information about new products, cutting-edge technology, mergers and acquisitions, and investments in new market themes and trends during these events. Such details can signal potential growth opportunities for investors, analysts, and portfolio managers.

Traditionally, earnings call scripts have followed similar templates, making it a repeatable task to generate them from scratch each time. On the other hand, generative artificial intelligence (AI) models can learn these templates and produce coherent scripts when fed with quarterly financial data. With generative AI, companies can streamline the process of creating first drafts of earnings call scripts for a new quarter using repeatable templates and information about specific performance and business highlights. The initial draft of a large language model (LLM) generated earnings call script can be then refined and customized using feedback from the company’s executives.

Amazon Bedrock offers a straightforward way to build and scale generative AI applications with foundation models (FMs) and LLMs. Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. Model customization helps you deliver differentiated and personalized user experiences. To customize models for specific tasks, you can privately fine-tune FMs using your own labeled datasets in just a few quick steps.

In this post, we showcase how to generate the first draft of an earnings call script for the new quarter using LLMs. We demonstrate two methods to generate an earnings call script with LLMs: few-shot learning and fine-tuning. We assess the generated earnings call scripts and the applied methods from different dimensions—comprehensiveness, hallucinations, writing style, ease of use, and cost—and present our findings.

Solution overview

We apply two methods to generate the first draft of an earnings call script for the new quarter using LLMs:

Prompt engineering with few-shot learning – We use examples of the past earnings scripts with Anthropic Claude 3 Sonnet on Amazon Bedrock to generate an earnings call script for a new quarter.
Fine-tuning – We fine-tune Meta Llama 2 70B on Amazon Bedrock using input/output labeled data from the past earnings scripts and use the customized model to generate an earnings call script for a new quarter.

Both methods involve utilizing a consistent dataset of earnings call transcripts across multiple quarters. We use several past years of quarterly earnings calls, with one quarter set aside, which was used as ground truth for testing and comparison.

The process starts by retrieving the earnings call transcripts from the past quarters to the recent quarter. The next step involves selecting multiple scripts from the previous quarters to serve as few-shot learning examples as well as input/output dataset for fine-tuning. The script for the most recent quarter is held out for validation and evaluation of generated scripts. The generated script is evaluated by comparing it with the actual script for the quarter, which was initially kept aside.

The following diagram illustrates the solution architecture and workflow for both methods.

In the following sections, we discuss the workflows of each method in more detail.

Few-shot learning with Anthropic Claude 3 Sonnet on Amazon Bedrock

The prompt engineering for few-shot learning using Anthropic Claude 3 Sonnet is divided into four sections, as shown in the following figure. Three sections have constant instructions to the LLM based on assigning the LLM a role, instructions on style and tone of narrative, and examples for earnings calls from past quarters for few-shot learning. The fourth section has information on financial performance, results, and business highlights for the current quarter for which earnings calls are to be generated by the LLM.

We used Anthropic Claude 3 Sonnet to generate an earnings call for a new quarter using earnings calls from past quarters. The following is an example of our few-shot learning along with prompt instructions:

Section A: Overall prompt instructions (context)

You are the CEO and CFO of Any Company preparing to present the quarterly earnings report to investors. Draft a comprehensive earnings call script that covers the key financial metrics, business highlights, and future outlook for the given quarter. Provide details on revenue, operating income, segment performance, and important strategic initiatives or product launches during the quarter.

Section B: Specific guidance for the earnings script (context)

The earnings script should be written in a formal, investor-friendly tone suitable for a public earnings call. Use clear and concise language to explain financial performance and business developments. Aim to strike a balance between providing sufficient details and keeping the script reasonably concise. Incorporate specific data points and figures but avoid overwhelming with excessive numerical minutiae. The overall structure should flow logically, covering key topics like revenue, operating income, segment highlights, strategic priorities, and forward-looking guidance. Use the following 5 instructions when generating results for the earnings call script.

1. Provide a clear structure by organizing the content into logical sections, such as financial highlights, segment performance, operational metrics, strategic initiatives, and a forward-looking view.
2. Include granular details and insights into the factors impacting performance, such as customer behavior trends, supply chain improvements, cost optimization efforts, and any other relevant context etc.
3. Substantiate your commentary with specific data points and percentages to lend credibility to your statements. 4. Offer a comprehensive forward-looking view by discussing capital investments, preparedness for upcoming events or seasons, and the long-term strategic focus or priorities.
5. Maintain a measured, objective, and analytical tone throughout the content, avoiding overly conversational or casual language.

Section C: Example Scripts from past quarters (for Few Shot/ Chain-of-thought)

The example scripts from past quarters provide a reference for the structure, tone, and level of detail expected in an earnings call script. Use these examples to understand how to present financial data, highlight key business initiatives, and address investor concerns or questions. However, ensure that the script for current specific Quarter is tailored to the specific financial performance and business events of that quarter.
<example>
Amazon Earnings call transcript for Q1 2021 ...

Amazon Earnings call transcript for Q2 2021 ...
<example>

Section D: Financial data for quarter for which script is required (context)

<financial_data>

Provide the actual financial results for the specific quarter, including:
Total revenue and year-over-year growth rate
Revenue breakdown by key segments (e.g. AWS, Online Stores, etc.)
Operating income (total and by segment if available)
Any key operating metrics (e.g. Prime membership, third-party seller metrics, etc.)
Notes on significant factors impacting results (e.g. foreign exchange, product launches, one-time events)
Forward-looking guidance on revenue, operating income for next quarter
Highlight key business developments, product launches or strategic priorities for the quarter :

<financial_data>

Fine-tune Meta Llama 2 70B on Amazon Bedrock

In this section, we present our approach to improving the quality of generated earnings call scripts by fine-tuning an LLM. We chose to adapt the Meta Llama 2 70B model, which is powerful and known for its strong performance across various natural languages tasks, to the specific domain of earnings call scripts.

The following diagram illustrates the workflow for our fine-tuning method.

To prepare the training data, we collected a comprehensive dataset of real earnings call transcripts from Q1 2021 to Q4 2022 for Amazon.com. This focused dataset allows the model to better learn the company’s domain-specific knowledge and terminology. The time span also makes sure the model can learn from recent trends and patterns in earnings communications.

Amazon Bedrock offers a model customization feature that enables you to directly use your own data to customize a wide variety of models. This feature not only helps improve model performance on specific tasks but also allows the model to better understand company-specific domain knowledge and terms, ultimately creating a better user experience.

To fine-tune a text-to-text model, you need to prepare training and optional validation datasets by creating a JSONL file with multiple JSON lines. Each JSON line is a sample containing both a prompt and completion field. In our use case, the prompt contains the prompt template, which includes key financial data for that quarter, and the completion field contains the actual earnings call transcript for that quarter.

We use the following prompt template:

{"prompt": ”Section A: Overall prompt instructions (context)… Section B: Specific guidance for the earnings script (context)… Section D: Financial data for Q1 2021 for which script is required (context) The financial data for {time_period} is:
<financial_data>{Section D}<financial_data> Please generate the earning report for {time_period} to the investors, based on the information provided above. Don't make up any information. ", "completion": ”Real earning call script for that Q1 2021"}

The training data is prepared in JSONL format, with each line representing an earnings call for a quarter:

{"prompt": "<prompt1>", "completion": "<expected generated text>"}
{"prompt": "<prompt2>", "completion": "<expected generated text>"}
{"prompt": "<prompt3>", "completion": "<expected generated text>"}

When the dataset is ready, we upload it to Amazon Simple Storage Service (Amazon S3) and set up a customization job in Amazon Bedrock. The training time varies from minutes to hours, depending on the size of the training data and the selected model. After the training job is complete, you must purchase Provisioned Throughput to use the model and generate future earnings call scripts. You can select the No Commitment option for Provisioned Throughput, which is billed on an hourly basis.

For inference, because some language models require a clear separation between the input prompt and expected output during fine-tuning, we need to add a special delimiting key before providing the input to the model. Specifically, for the Meta Llama 2 70B model, we add the key nn Response:n after the input prompt. This delimiter helps the model distinguish where the prompt ends and the expected response should begin, allowing it to generate more accurate outputs. The prompt would look as follows:

Prompt:
{User_Input_Prompt}

Response:

By providing this formatted prompt during inference, the fine-tuned Meta Llama 2 70B model can better understand the input context and generate a more relevant earnings call script as the response.

For better performance, you can use the same prompt template with the current quarter’s financial data (without the few-shot learning examples), format it with the delimiter, and send it to the customized model to generate the final earnings call script for that quarter.

Evaluation of few-shot prompt engineering and fine-tuning

We evaluated the generated earnings call transcripts from both methods (few-shot prompt engineering and fine-tuning) using two different approaches:

Evaluated by a human reviewer
Evaluated by comparing three variations using an LLM (Anthropic Claude 3 Sonnet)

Evaluated by human reviewer

The following table summarizes a human reviewer’s evaluation.

It is imperative to note that two factors contributed to the differences: varying approaches (few-shot learning and fine-tuning) and disparate models (Anthropic Claude 3 and Meta Llama 70B). Consequently, the results cannot be interpreted as a mere comparison of models. It is advisable to explore the approaches with your specific use case and data, and subsequently evaluate the outcomes by discussing with subject matter experts from the relevant business department.

Factor	Fine-Tuned Model	Few-shot Prompt Engineering
Comprehensiveness	The script covers most of the key points provided in the prompts, although it ignored a few details. For example, it misses the point that the growth in advertising was primarily driven by using machine learning models to improve relevancy of ads.	The script covers key points provided in the prompts.
Hallucination	Two instances. (1) “This growth was driven by strong demand for our Prime Day event, which saw record-breaking sales and attracted millions of new Prime members.” (2) “This growth was driven by strong demand in our key markets, including India and Japan.”	Once. (1) “In North America, revenue grew 11% year-over-year to $87.9 billion, fueled by continued robust demand and greater purchase frequency by Prime Members.”
Writing style	(1) This script uses mostly objective and precise language, which is consistent with the real earnings call. Still, it has subjective expressions such as “a huge success,” and imprecise expressions such as “double digit growth.” (2) The language offers less variations. For example, it uses the format of “This ___ was driven by ___” 10 times without variations. (3) The model generated some additional sentences. For example, “Now, let’s turn to our forward guidance. At this time, we’re not providing specific revenue or operating income guidance for the fourth quarter.“	The real earnings call uses precise and objective language, while this script uses more metaphoric expressions such as “laser-focused” and “made further strides,” as well as subjective expressions such as “invest prudently” and “disciplined execution.“
Ease of Use	(1) Fine-tuning a model in Amazon Bedrock gives the option of following steps on the Amazon Bedrock console or apply coding to interact with LLMs on Amazon Bedrock through the API. (2) The fine-tuning process generally takes longer compared to few-shot prompt engineering based on the same documents. (3) Fine-tuning requires preparing data in input/output format (JSON files) for training the selected model. (4) If a new document is added, the whole fine-tuned model needs to be updated by going through the same fine-tuning process.	(1) Amazon Bedrock allows users to give instructions and example data to an LLM as is using both the UI or creating reproducible codes. (2) If a new document is added, the user only needs to add to the prompt an example for few-shot learning or prompt instructions. Overall, few-shot prompt engineering is easier to implement, compared to fine-tuning a model.
Cost	Monthly cost incurred for fine-tuning = Fine-tuning training cost for the model (priced by number of tokens for training data) + custom model storage per month + hourly cost (or Provisioned Throughput cost for time commitment) of custom model inference.	Priced by number of input (few-shot prompts and examples) and output tokens for the model.

The cost comparison can be further evaluated by the frequency of usage, as shown in the following table.

Method	One-Time Cost	Recurring Cost	Inference Cost
Fine-Tuning	Priced by the number of tokens for training data	Custom model storage cost per month	Custom model inference cost (hourly or Provisioned Throughput commitment)
Few-Shot Prompt Engineering	N/A	N/A	Priced by number of input (prompts and examples) and output tokens

Evaluated by comparing three variations using an LLM

We tested the following variations:

Variation A – Earnings call transcript from few-shot learning with Anthropic Claude v3 Sonnet
Variation B – Earnings call transcript with fine-tuned Meta Llama 70B
Variation C – Actual earnings call transcript for the quarter

The following table summarizes the key similarities and differences between the three variations of the Amazon Q3 2023 earnings call transcript. Variation A and Variation B have two main differences – different approaches (few-shot learning vs fine-tuning) and different models (Anthropic Claude 3 vs Meta Llama 70B).

.	Identified Factor	Result Summaries
Similarities	Financial Metrics	All variations report strong financial results, with revenue growth around 11% year-over-year and significant increases in operating income.
	Business Highlights	They highlight the success of Prime Day as a major driver of sales and Prime member growth. The transcripts mention continued growth in third-party seller services, advertising, and AWS.
	Management Focus	There is a focus on improving operational efficiency, cost optimization, and supply chain/delivery improvements.
	Innovation and Partnerships	Generative AI initiatives and partnerships (such as Anthropic, Amazon Bedrock, and Amazon CodeWhisperer) are discussed in relation to AWS.
Dissimilarities	Level of Financial Detail	Variation A provides more detailed financials (exact revenue, operating income figures) than B and C.
	Narrative/ Commentary Style –	Variation B has more personal commentary from “Jeff Bezos” and “Brian Olsavsky” compared to A and C’s more generic and impersonal style.
	Level of Business Detail –	Variation C goes into more specifics on initiatives like regionalization, inventory optimization, and cost reduction efforts. Variation A discusses priorities and forward-looking initiatives in more depth compared to B and C.
	Forward Guidance	Only Variation C mentions actual forward guidance on capital investments for 2023.

Moreover, we can compare the difference between A vs. C and B vs. C to better compare the generated results to the actual earning scripts.

Identified Factor	Difference between A & C	Difference between B & C
Financial Details	A lacks some of the specific financial details and figures present in the actual script.	B is more similar to the actual script in terms of providing segment-wise financial figures and percentages.
Depth of Content	A mentions broad themes and priorities, whereas C dives deeper into operational metrics, cost savings initiatives, and strategic updates.	C provides additional details on topics like free cash flow, capital investments, and strategic initiatives like generative AI.

Overall, although the core financial highlights are similar, there are nuances in the depth of details provided and the narrative and commentary style across the three variations.

Conclusion

Generating high-quality earnings call script drafts using LLMs is a promising approach that can streamline the process for companies. Both the few-shot prompt engineering and fine-tuning methods demonstrated the ability to produce scripts covering key financial metrics, business updates, and forward-looking guidance. Each method has its own nuances. However, there are trade-offs in terms of comprehensiveness, hallucinations, writing style, ease of implementation, and cost that companies must evaluate based on their specific needs and priorities. As language models continue advancing, further research in customizing and refining these models for the financial services and capital markets domain could unlock even more value for financial communications processes.

This blog presents a framework for two different approaches: few-shot prompt engineering and fine-tuning with Large Language Models (LLMs), followed by an evaluation of the results. The findings should not be interpreted as prescriptive recommendations for favoring one approach over the other, as the choice depends on the specific content and prompts. Additionally, the results should not be construed as a direct comparison of LLMs, as the methodologies employed with each LLM differ, making it an apples-to-oranges comparison. As LLMs continue to advance, we anticipate further improvements in their output quality.

As next steps, you can use Amazon Bedrock to explore your own data and use cases. You can engage in few-shot prompt engineering and fine-tuning methods with different LLMs on Amazon Bedrock, using your specific data securely and privately. Furthermore, you can evaluate the results of these methods by collaborating with subject matter experts or using evaluation frameworks, enabling you to assess the performance and suitability of the methods and LLMs on Amazon Bedrock for your particular use case. You can try out and compare the results, and either use prompt engineering or deploy your own fine-tuned model to generate the earnings calls tied to your company. You can also evaluate both approaches for any related use case.

Refer to Prompt engineering guidelines and Custom models for more information about these two methods. To learn more about applying generative AI for investment research, please refer to AI-powered assistants for investment research with multi-modal data: An application of Agents for Amazon Bedrock.

Refer to this blog to find out more about, empowering analysts to perform financial statement analysis, hypothesis testing, and cause-effect analysis with Amazon Bedrock, Anthropic Claude 3 Sonnet, and prompt engineering

About the Authors

Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers leverage GenAI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a Ph.D. degree in Electrical Engineering. Outside of work, she loves traveling, working out and exploring new things.

Jia (Vivian) Li is a Senior Solutions Architect in AWS, with specialization in AI/ML. She currently supports customers in financial industry. Prior to joining AWS in 2022, she had 7 years of experience supporting enterprise customers use AI/ML in the cloud to drive business results. Vivian has a BS from Peking University and a PhD from University of Southern California. In her spare time, she enjoys all the water activities, and hiking in the beautiful mountains in her home state, Colorado.