Enabling complex generative AI applications with Amazon Bedrock Agents

Enabling complex generative AI applications with Amazon Bedrock Agents

In June, I started a series of posts that highlight the key factors that are driving customers to choose Amazon Bedrock. The first covered building generative AI apps securely with Amazon Bedrock, while the second explored building custom generative AI applications with Amazon Bedrock. Now I’d like to take a closer look at Amazon Bedrock Agents, which empowers our customers to build intelligent and context-aware generative AI applications, streamlining complex workflows and delivering natural, conversational user experiences. The advent of large language models (LLMs) has enabled humans to interact with computers using natural language. However, many real-world scenarios demand more than just language comprehension. They involve executing complex multi-step workflows, integrating external data sources, or seamlessly orchestrating diverse AI capabilities and data workflows. In these real-world scenarios, agents can be a game changer, delivering more customized generative AI applications—and transforming the way we interact with and use LLMs.

Answering more complex queries

Amazon Bedrock Agents enables a developer to take a holistic approach in improving scalability, latency, and performance when building generative AI applications. Generative AI solutions that use Amazon Bedrock Agents can handle complex tasks by combining an LLM with other tools. For example, imagine that you are trying to create a generative AI-enabled assistant that helps people plan their vacations. You want it to be able to handle simple questions like “What’s the weather like in Paris next week?” or “How much does it cost to fly to Tokyo in July?” A basic virtual assistant might be able to answer those questions drawing from preprogrammed responses or by searching the Internet. But what if someone asks a more complicated question, like “I want to plan a trip to three countries next summer. Can you suggest a travel itinerary that includes visiting historic landmarks, trying local cuisine, and staying within a budget of $3,000?” That is a harder question because it involves planning, budgeting, and finding information about different destinations.

Using Amazon Bedrock Agents, a developer can quickly build a generative assistant to help answer this more complicated question by combining the LLM’s reasoning with additional tools and resources, such as natively integrated knowledge bases to propose personalized itineraries. It could search for flights, hotels, and tourist attractions by querying travel APIs, and use private data, public information for destinations, and weather—while keeping track of the budget and the traveler’s preferences. To build this agent, you would need an LLM to understand and respond to questions. But you would also need other modules for planning, budgeting, and accessing travel information.

Agents in action

Our customers are using Amazon Bedrock Agents to build agents—and agent-driven applications—quickly and effectively. Consider Rocket, the fintech company that helps people achieve home ownership and financial freedom:

“Rocket is poised to revolutionize the homeownership journey with AI technology, and agentic AI frameworks are key to our mission. By collaborating with AWS and leveraging Amazon Bedrock Agents, we are enhancing the speed, accuracy, and personalization of our technology-driven communication with clients. This integration, powered by Rocket’s 10 petabytes of data and industry expertise, ensures our clients can navigate complex financial moments with confidence.”

– Shawn Malhotra, CTO of Rocket Companies.

A closer look at how agents work

Unlike LLMs that provide simple lookup or content-generation capabilities, agents integrate various components with an LLM to create an intelligent orchestrator capable of handling sophisticated use cases with nuanced context and specific domain expertise. The following figure outlines the key components of Amazon Bedrock Agents:

The process starts with two parts—the LLM and the orchestration prompt. The LLM—often implemented using models like those in the Anthropic Claude family or Meta Llama models—provides the basic reasoning capabilities. The orchestration prompt is a set of prompts or instructions that guide the LLM when driving the decision-making process.

In the following sections, we discuss the key components of Amazon Bedrock Agents in depth:

Planning: A path from task to goals

The planning component for LLMs entails comprehending tasks and devising multi-step strategies to address a problem and fulfill the user’s need. In Amazon Bedrock Agents, we use chain-of-thought prompting in combination with ReAct in the orchestration prompt to improve an agent’s ability to solve multi-step tasks. In task decomposition, the agent must understand the intricacies of an abstract request. Continuing to explore our travel scenario, if a user wants to book a trip, the agent must recognize that it encompasses transportation, accommodation, reservations for sightseeing attractions, and restaurants. This ability to split up an abstract request, such as planning a trip, into detailed, executable actions, is the essence of planning. However, planning extends beyond the initial formulation of a plan, because during execution, the plan may get dynamically updated. For example, when the agent has completed arranging transportation and progresses to booking accommodation, it may encounter circumstances where no suitable lodging options align with the original arrival date. In such scenarios, the agent must determine whether to broaden the hotel search or revisit alternative booking dates, adapting the plan as it evolves.

Memory: Home for critical information

Agents have both long-term and short-term memory. Short-term memory is detailed and exact. It is relevant to the current conversation and resets when the conversation is over. Long-term memory is episodic and remembers important facts and details in the form of saved summaries. These summaries serve as the memory synopses of previous dialogues. The agent uses this information from the memory store to better solve the current task. The memory store is separate from the LLM, with a dedicated storage and a retrieval component. Developers have the option to customize and control which information is stored (or excluded) in memory. An identity management feature, which associates memory with specific end-users, gives developers the freedom to identify and manage end-users—and enable further development on top of Amazon Bedrock agents’ memory capabilities. The industry-leading memory retention functionality of Amazon Bedrock—launched at the recent AWS New York Summit—allows agents to learn and adapt to each user’s preferences over time, enabling more personalized and efficient experiences across multiple sessions for the same user. It is straightforward to use, allowing users to get started in a single click.

Communication: Using multiple agents for greater efficiency and effectiveness

Drawing from the powerful combination of the capabilities we’ve described, Amazon Bedrock Agents makes it effortless to build agents that transform one-shot query responders into sophisticated orchestrators capable of tackling complex, multifaceted use cases with remarkable efficiency and adaptability. But what about using multiple agents? LLM-based AI agents can collaborate with one another to improve efficiency in solving complex questions. Today, Amazon Bedrock makes it straightforward for developers to connect them through LangGraph, part of LangChain, the popular open source tool set. The integration of LangGraph into Amazon Bedrock empowers users to take advantage of the strengths of multiple agents seamlessly, fostering a collaborative environment that enhances the overall efficiency and effectiveness of LLM-based systems.

Tool Integration: New tools mean new capabilities

New generations of models such as Anthropic Claude Sonnet 3.5, Meta Llama 3.1, or Amazon Titan Text Premier are better equipped to use reources. Using these resources requires that developers keep up with ongoing updates and changes, requiring new prompts every time. To reduce this burden, Amazon Bedrock simplifies interfacing with different models, making it effortless to take advantage of all the features a model has to offer. For example, the new code interpretation capability announced at the recent AWS New York Summit allows Amazon Bedrock agents to dynamically generate and run code snippets within a secure, sandboxed environment to address complex tasks like data analysis, visualization, text processing, and equation solving. It also enables agents to process input files in various formats—including CSV, Excel, JSON—and generate charts from data.

Guardrails: Building securely

Accuracy is critical when dealing with complex queries. Developers can enable Amazon Bedrock Guardrails to help reduce inaccuracies. Guardrails improve the behavior of the applications you’re building, increase accuracy, and help you build responsibly. They can prevent both malicious intent from users and potentially toxic content generated by AI, providing a higher level of safety and privacy protection.

Amplifying and extending the capabilities of generative AI with Amazon Bedrock Agents

Enterprises, startups, ISVs, and systems integrators can take advantage of Amazon Bedrock Agents today because it provides development teams with a comprehensive solution for building and deploying AI applications that can handle complex queries, use private data sources, and adhere to responsible AI practices. Developers can start with tested examples—so-called golden utterances (input prompts) and golden responses (expected outputs). You can continuously evolve agents to fit your top use cases and kickstart your generative AI application development. Agents unlock significant new opportunities to build generative AI applications to truly transform your business. It will be fascinating to see the solutions—and results—that Amazon Bedrock Agents inspires.

Resources

For more information on customization with Amazon Bedrock, see the following resources:


About the author

Vasi Philomin is VP of Generative AI at AWS. He leads generative AI efforts, including Amazon Bedrock and Amazon Titan.

Read More

Genomics England uses Amazon SageMaker to predict cancer subtypes and patient survival from multi-modal data

Genomics England uses Amazon SageMaker to predict cancer subtypes and patient survival from multi-modal data

This post is co-written with Francisco Azuaje from Genomics England.

Genomics England analyzes sequenced genomes for The National Health Service (NHS) in the United Kingdom, and then equips researchers to use data to advance biological research. As part of its goal to help people live longer, healthier lives, Genomics England is interested in facilitating more accurate identification of cancer subtypes and severity, using machine learning (ML). To explore whether such ML models can perform at higher accuracy when using multiple modalities, such as genomic and imaging data, Genomics England has launched a multi-modal program aimed at enhancing its dataset and also partnered with the the AWS Global Health and Non-profit Go-to-Market (GHN-GTM) Data Science and AWS Professional Services teams to create an automatic cancer sub-typing and survival detection pipeline and explore its accuracy on publicly available data.

In this post, we detail our collaboration in creating two proof of concept (PoC) exercises around multi-modal machine learning for survival analysis and cancer sub-typing, using genomic (gene expression, mutation and copy number variant data) and imaging (histopathology slides) data. We provide insights on interpretability, robustness, and best practices of architecting complex ML workflows on AWS with Amazon SageMaker. These multi-modal pipelines are being used on the Genomics England cancer cohort to enhance our understanding of cancer biomarkers and biology.

1. Data

The PoCs have used the publicly available cancer research data from The Cancer Genome Atlas (TCGA), which contain paired high-throughput genome analysis and diagnostic whole slide images with ground-truth survival outcome and histologic grade labels. Specifically, the PoCs focus on whole slide histopathology images of tissue samples as well as gene expression, copy number variations, and the presence of deleterious genetic variants to perform analysis on two cancer types: Breast cancer (BRCA) and gastrointestinal cancer types (Pan-GI). Table 1 shows the sample sizes for each cancer type.

Table 1. Overview of input data sizes across the different cancer types investigated.

2. Multi-modal machine learning frameworks

The ML pipelines tackling multi-modal subtyping and survival prediction have been built in three phases throughout the PoC exercises. First, a state-of-the-art framework, namely Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE) (Chen et al., 2022) was implemented (Section 2.1). This was followed by the proposal, development, and implementation of a novel architecture based on Hierarchical Extremum Encoding (HEEC) (Section 2.2) by AWS, which aimed to mitigate the limitations of PORPOISE. The final phase improved on the results of HEEC and PORPOISE—both of which have been trained in a supervised fashion—using a foundation model trained in a self-supervised manner, namely Hierarchical Image Pyramid Transformer (HIPT) (Chen et al., 2023).

2.1 Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE)

PORPOISE (Chen et al., 2022) is a multi-modal ML framework that consists of three sub-network components (see Figure 1 at Chen et al., 2022):

  1. CLAM component; an attention-based multiple-instance learning network trained on pre-processed whole slid image (WSI) inputs (CLAM, Lu et al., 2021). CLAM extracts features from image patches of size 256×256 using a pre-trained ResNet50.
  2. A self-normalizing network component for extracting deep molecular features.
  3. A multi-modal fusion layer for integrating feature representations from 1) and 2) by modelling their pairwise interactions. The joint representations obtained from 3) are then used for undertaking the downstream tasks such as survival analysis and cancer-subtyping.

Despite being performant, PORPOISE was observed to output reduced multi-modal performance than single best modality (imaging) performance alone when gene expression data was excluded from the genomic features while performing survival analysis for Pan-GI data (Figure 2). A possible explanation is that the model has difficulty dealing with the extremely high dimensional, sparse genomic data without overfitting.

2.2. Hierarchical Extremum Encoding (HEEC): A novel supervised multi-modal ML framework

To mitigate the limitations of PORPOISE, AWS has developed a novel model structure, HEEC, which is based on three ideas:

  1. Using tree ensembles (LightGBM) to mitigate the sparsity and overfitting issue observed when training PORPOISE (as observed by Grinsztajn et al., 2022, tree-based models tend to overfit less when confronted with high-dimensional data with many largely uninformative features).
  2. Representation construction using a novel encoding scheme (extremum encoding) that preserves spatial relationships and thus interpretability.
  3. Hierarchical learning to allow representations at multiple spatial scales.

Figure 1. Hierarchical Extremum Encoding (HEEC) of pathomic representations.

Figure 1 summarizes the HEEC architecture: starting from the bottom (and clockwise): Every input WSI is cut up into patches of size 4096×4096 and 256×256 pixels in a hierarchical manner and all stacks of patches are fed through ResNet50 to obtain embedding vectors. Additionally, nucleus-level representations (of size 64×64 pixels) are extracted by a graph neural network (GNNs), allowing local nucleus neighborhoods and their spatial relationships to be taken into account. This is followed by filtering for redundancy: Patch embeddings that are important are selected using positive-unlabeled learning, and GNN importance filtering is used for retaining the top nuclei features. The resulting hierarchical embeddings are coded using extremum encoding: the maxima and minima across the embeddings are taken in each vector entry, resulting in a single vector of maxima and minima per WSI. This encoding scheme allows keeping exact track of spatial relationships for each entry in the resulting representation vectors because the model can backtrack each vector entry to a specific patch, and thus to a specific coordinate in the image.

On the genomics side, importance filtering is applied based on excluding features that don’t correlate with the prediction target. The remaining features are horizontally appended to the pathology features, and a gradient boosted decision tree classifier (LightGBM) is applied to achieve predictive analysis.

HEEC architecture is interpretable out of the box, because HEEC embeddings possess implicit spatial information and the LightGBM model supports feature importance, allowing the filtering of the most important features for accurate prediction and backtracking to their location of origin. This location can be visually highlighted on the histology slide to be presented to expert pathologists for verification. Table 2 and Figure 2 show performance results of PORPOISE and HEEC, which show that HEEC is the only algorithm that outperforms the results of the best-performing single modality by combining multiple modalities.

Table 2. Classification and survival prediction performance of the two implemented multi-modal ML models on TCGA data. *Although Chen et al., 2022 provide some interpretability, the proposed attention visualization heatmaps have been deemed difficult to interpret from the pathologist point of view by Genomics England domain experts.

Figure 2. Comparison of performance (AUC) across individual modalities for survival analysis, when excluding the gene expression data. This matches the setting encountered by GEL in practice (GEL’s internal dataset has no gene expression data)

2.3. Improvements using foundation models

Despite yielding promising results, PORPOISE and HEEC algorithms use backbone architectures trained using supervised learning (for example, ImageNet pre-trained ResNet50). To further improve performance, a self-supervised learning-based approach, namely Hierarchical Image Pyramid Transformer (HIPT) (Chen et al., 2023), has been investigated in the final stage of the PoC exercises. Note that HIPT is currently limited to the hierarchical self-supervised learning of the imaging modality (WSIs) and further work includes expansion of self-supervised learning for the genomic modality.

HIPT starts by defining a hierarchy of patches composed of non-overlapping regions of size 16×16, 256×256, and 4096×4096 pixels (see Figure 2 at Chen et al., 2023). The lowest-layer features are extracted from the smallest patches (16×16) using a self-supervised learning algorithm based on DINO with a Vision Transformer (ViT) backbone. For each 256×256 region, the lowest-layer features are then aggregated using a global pooling layer. The aggregated features constitute the (new input) features for the middle-level in the hierarchy, where the process of self-supervised learning followed by global pooling is repeated and the aggregated output features form the input features belonging to the 4096×4096 region. These input features go through self-supervised learning one last time, and the final embeddings are obtained using global attention pooling. After pre-training is completed, fine-tuning is employed only on the final layer of the hierarchy (acting on 4096×4096 regions) using multiple instance learning.

Genomics England investigated whether using HIPT embeddings would be better than using the ImageNet pretrained ResNet50 encoder, and initial experiments have shown a gain in Harrels C-index of approximately 0.05 per cancer type in survival analysis. The embeddings offer other benefits as well. Such as being smaller—meaning that models train faster and the features have a smaller footprint.

3. Architecture on AWS

As part of the PoCs, we built a reference architecture (illustrated in Figure 3) for multi-modal ML using SageMaker, a platform for building training, and deploying ML models, with fully managed infrastructure, tools, and workflows. We aimed to demonstrate some general, reusable patterns that are independent of the specific algorithms:

  • Decouple data pre-processing and feature computation from model training: In our use case, we process the pathology images into numerical feature representations once, we then store the resulting feature vectors in Amazon Simple Storage Service (Amazon S3) and reuse them to train different models. Analogously, we have a second processing branch that processes and extracts features from the genomic data.
  • Decouple model training from inference: As we experiment with different model structures and hyperparameters, we keep track of model versions, hyperparameters, and metrics in SageMaker model registry. We refer to the registry to review our experiments and choose which models to deploy for inference.
  • Wrap long-running computations inside containers and delegate their execution to SageMaker: Any long-running computation benefits from this pattern, whether it’s for data processing, model training, or batch inference. In this way, there’s no need to manage the underlying compute resources for running the containers. Cost is reduced through a pay-as-you-go model (resources are destroyed after a container has finished running) and the architecture is easily scalable to run multiple jobs in parallel.
  • Orchestrate multiple containerized jobs into SageMaker pipelines: We build a pipeline once and run it multiple times with different parametrization. Hence, pipeline invocations can be referred to at a higher-level of abstraction, without having to constantly monitor the status of its long-running constituent jobs.
  • Delegate hyperparameter tuning to SageMaker using a hyperparameter tuning job: A tuning job is a family of related training jobs (all managed by SageMaker) that efficiently explore the hyperparameter space. These training jobs take the same input data for training and validation, but each one is run with different hyperparameters for the learning algorithm. Which hyperparameter values to explore at each iteration are automatically chosen by SageMaker.

3.1 Separation between development and production environments

In general, we advise to do all development work outside of a production environment, because this minimizes the risk of leakage and corruption of sensitive production data and the production environment isn’t contaminated with intermediate data and software artifacts that obfuscate lineage tracking. If data scientists require access to production data during developmental stages, for tasks such as exploratory analysis and modelling work, there are numerous strategies that can be employed to minimize risk. One effective strategy is to employ data masking or synthetic data generation techniques in the testing environment to simulate real-world scenarios without compromising sensitive data. Furthermore, production level data can be securely moved into an independent environment for analysis. Access controls and permissions can be implemented to restrict the flow of data between environments, maintaining separation and ensuring minimal access rights.

Genomics England has created two separate ML environments for testing and production level interaction with data. Each environment sits in its own isolated AWS account. The test environment mimics the production environment in its data storage strategy, but contains synthetic data void of personally identifiable information (PII) or protected health information (PHI), instead of production-level data. This test environment is used for developing essential infrastructure components and refining best practices in a controlled setting, which can be tested with synthetic data before deploying to production. Strict access controls, including role-based permissions employing principles of least privilege, are implemented in all environments to ensure that only authorized personnel can interact with sensitive data or modify deployed resources.

3.2 Automation with CI/CD pipelines

On a related note, we advise ML developers to use infrastructure-as-code to describe the resources that are deployed in their AWS accounts and use continuous integration and delivery (CI/CD) pipelines to automate code quality checks, unit testing, and the creation of artifacts, such as container images. Then, also configure the CI/CD pipelines to automatically deploy the created artifacts into the target AWS accounts, whether they’re for development or for production. These well-established automation techniques minimize errors related to manual deployments and maximize the reproducibility between development and production environments.

Genomics England has investigated the use of CI/CD pipelines for automated deployment of platform resources, as well as automated testing of code.

Figure 3. Overview of the AWS reference architecture employed for multi-modal ML in the cloud

4. Conclusion

Genomics England has a long history of working with genomics data, however the inclusion of imaging data adds additional complexity and potential. The two PoCs outlined in this post have been essential in launching Genomics England’s efforts in creating a multi-modal environment that facilitates ML development for the purpose of tackling cancer. The implementation of state-of-the-art models in Genomics England’s multi-modal environment and assistance in developing robust practices will ensure that users are maximally enabled in their research.

At Genomics England, our mission is to realize the enormous potential of genomic and multi-modal information to further precision medicine and push the boundaries to realize the enormous potential of AWS cloud computing in its success”.

– Dr Prabhu Arumugam, Director of Clinical data and imaging, Genomics England

Acknowledgements

The results published in this blog post are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.


About the Authors

Cemre Zor, PhD, is a senior healthcare data scientist at Amazon Web Services. Cemre holds a PhD in theoretical machine learning and postdoctoral experiences on machine learning for computer vision and healthcare. She works with healthcare and life sciences customers globally to support them with machine learning modelling and advanced analytics approaches while tackling real-world healthcare problems.

Tamas Madl, PhD, is a former senior healthcare data scientist and business development lead at Amazon Web Services, with academic as well as industry experience at the intersection between healthcare and machine learning. Tamas helped customers in the Healthcare and Life Science vertical to innovate through the adoption of Machine Learning. He received his PhD in Computer Science from the University of Manchester.

Epameinondas Fritzilas, PhD, is a senior consultant at Amazon Web Services. He works hands-on with customers to design and build solutions for data analytics and AI applications in healthcare. He holds a PhD in bioinformatics and fifteen years of industry experience in the biotech and healthcare sectors.

Lou Warnett is a healthcare data scientist at Amazon Web Services. He assists healthcare and life sciences customers from across the world in tackling some of their most pressing challenges using data science, machine learning and AI, with a particular emphasis more recently on generative AI. Prior to joining AWS, Lou received a master’s in Mathematics and Computing at Imperial College London.

Sam Price is a Professional Services consultant specializing in AI/ML and data analytics at Amazon Web Services. He works closely with public sector customers in healthcare and life sciences to solve challenging problems. When not doing this, Sam enjoys playing guitar and tennis, and seeing his favorite indie bands.

Shreya Ruparelia is a data & AI consultant at Amazon Web Services, specialising in data science and machine learning, with a focus on developing GenAI applications. She collaborates with public sector healthcare organisations to create innovative AI-driven solutions. In her free time, Shreya enjoys activities such as playing tennis, swimming, exploring new countries and taking walks with the family dog, Buddy.

Pablo Nicolas Nunez Polcher, MSc, is a senior solutions architect working for the Public Sector team with Amazon Web Services. Pablo focuses on helping healthcare public sector customers build new, innovative products on AWS in accordance with best practices. He received his M.Sc. in Biological Sciences from Universidad de Buenos Aires. In his spare time, he enjoys cycling and tinkering with ML-enabled embedded devices.

Matthew Howard is the head of Healthcare Data Science and part of the Global Health and Non-Profits team in Amazon Web Services. He focuses on how data, machine learning and artificial intelligence can transform health systems and improve patient outcomes. He leads a team of applied data scientists who work with customers to develop AI-based healthcare solutions. Matthew holds a PhD in Biological Sciences from Imperial College London.

Tom Dyer is a Senior Product Manager at Genomics England. And was previously an Applied Machine Learning Engineer working within the Multimodal squad. His work focussed on building multimodal learning frameworks that allow users to rapidly scale research in the cloud. He also works on developing ML tooling to organise pathology image datasets and explain model predictions on a cohort level

Samuel Barnett is an applied machine learning engineer with Genomics England working on improving healthcare with machine learning. He is embedded with the Multimodal squad and is part of an ongoing effort to show the value of combing genomic, imaging, and text based data in machine learning models.

Prabhu Arumugam is the former Director of Clinical Data Imaging at Genomics England. Having joined the organization in 2019, Prabhu trained in medicine St. Bartholomew’s and the Royal London. He trained in Histopathology and completed his PhD at The Barts Cancer Institute on pancreatic pathology.

Francisco Azuaje, PhD, is the director of bioinformatics at Genomics England, where he provides cross-cutting leadership in strategy and research with a focus on data science and AI. With a career covering academia, the pharmaceutical industry, and the public sector, he has wide experience leading multidisciplinary teams in solving challenges involving diverse data sources and computational modelling approaches. With his expertise in bioinformatics and applied AI, Dr. Azuaje enables the translation of complex data into insights that can improve patient outcomes.

Read More

Align Meta Llama 3 to human preferences with DPO, Amazon SageMaker Studio, and Amazon SageMaker Ground Truth

Align Meta Llama 3 to human preferences with DPO, Amazon SageMaker Studio, and Amazon SageMaker Ground Truth

Large language models (LLMs) have remarkable capabilities. Nevertheless, using them in customer-facing applications often requires tailoring their responses to align with your organization’s values and brand identity. In this post, we demonstrate how to use direct preference optimization (DPO), a technique that allows you to fine-tune an LLM with human preference data, together with Amazon SageMaker Studio and Amazon SageMaker Ground Truth to align the Meta Llama 3 8B Instruct model responses to your organization’s values.

Using SageMaker Studio and SageMaker Ground Truth for DPO

With DPO, you can fine-tune an LLM with human preference data such as ratings or rankings so that it generates outputs that align to end-user expectations. DPO is computationally efficient and helps enhance a model’s helpfulness, honesty, and harmlessness, divert the LLM from addressing specific subjects, and mitigate biases. In this technique, you typically start with selecting an existing or training a new supervised fine-tuned (SFT) model. You use the model to generate responses and you gather human feedback on these responses. After that, you use this feedback to perform DPO fine-tuning and align the model to human preferences.

Whether you are fine-tuning a pre-trained LLM with supervised fine-tuning (SFT) or loading an existing fine-tuned model for DPO, you typically need powerful GPUs. The same applies during DPO fine-tuning. With Amazon SageMaker, you can get started quickly and experiment rapidly by using managed Jupyter notebooks equipped with GPU instances. You can quickly get started by creating a JupyterLab space in SageMaker Studio, the integrated development environment (IDE) purpose-built for machine learning (ML), launch a JupyterLab application that runs on a GPU instance.

Orchestrating the end-to-end data collection workflow and developing an application for annotators to rate or rank model responses for DPO fine-tuning can be time-consuming. SageMaker Ground Truth offers human-in-the-loop capabilities that help you set up workflows, manage annotators, and collect consistent, high-quality feedback.

This post walks you through the steps of using DPO to align an SFT model’s responses to the values of a fictional digital bank called Example Bank. Your notebook runs in a JupyterLab space in SageMaker Studio powered by a single ml.g5.48xlarge instance (8 A10G GPUs). Optionally, you can choose to run this notebook inside a smaller instance type such as ml.g5.12xlarge (4 A10G GPUs) or ml.g6.12xlarge (4 L4 GPUs) with bitsandbytes quantization. You use Meta Llama 3 8B Instruct (the Meta Llama 3 instruction tuned model optimized for dialogue use cases from the Hugging Face Hub) to generate responses, SageMaker Ground Truth to collect preference data, and the DPOTrainer from the HuggingFace TRL library for DPO fine-tuning together with Parameter-Efficient Fine-Tuning (PEFT). You also deploy the aligned model to a SageMaker endpoint for real-time inference. You can use the same approach with other models.

Solution overview

The following diagram illustrates the approach.

The workflow contains the following key steps:

  1. Load the Meta Llama 3 8B Instruct model into SageMaker Studio and generate responses for a curated set of common and toxic questions. The dataset serves as the initial benchmark for the model’s performance.
  2. The generated question-answer pairs are stored in Amazon Simple Storage Service (Amazon S3). These will be presented to the human annotators later so they can rank the model responses.
  3. Create a workflow in SageMaker Ground Truth to gather human preference data for the responses. This involves creating a work team, designing a UI for feedback collection, and setting up a labeling job.
  4. Human annotators interact with the labeling portal to evaluate and rank the model’s responses based on their alignment to the organization’s values.
  5. The collected data is processed to adhere to the DPOTrainer expected format.
  6. Using the Hugging Face TRL library and the DPOTrainer, fine-tune the Llama 3 model using the processed data from the previous step.
  7. Test the fine-tuned model on a holdout evaluation dataset to assess its performance and verify it meets the desired standards.
  8. When you’re satisfied with the model performance, you can deploy it to a SageMaker endpoint for real-time inference at scale.

Prerequisites

To run the solution described in this post, you must have an AWS account set up, along with an AWS Identity and Access Management (IAM) role that grants you the necessary permissions to create and access the solution resources. If you are new to AWS and haven’t created an account yet, refer to Create a standalone AWS account.

To use SageMaker Studio, you need to have a SageMaker domain set up with a user profile that has the necessary permissions to launch the SageMaker Studio application. If you’re new to SageMaker Studio, the Quick Studio setup is the fastest way to get started. With a single click, SageMaker provisions the required domain with default presets, including setting up the user profile, IAM role, IAM authentication, and public internet access. The notebook associated with this post assumes the use of an ml.g5.48xlarge instance type. To review or increase your quota limits, navigate to the AWS Service Quotas console, choose AWS Services in the navigation pane, choose Amazon SageMaker, and refer to the value for Studio JupyterLab Apps running on ml.g5.48xlarge instances.

Request an increase in quota value greater than or equal to 1 for experimentation.

Meta Llama 3 8B Instruct is available under the Llama 3 license. To download the model from Hugging Face, you need an access token. If you don’t already have one, navigate to the Settings page on the Hugging Face website to obtain it.

Make sure that the SageMaker Studio role has the necessary permissions for SageMaker Ground Truth and Amazon S3 access. When you’re working in SageMaker Studio, you’re already using an IAM role, which you’ll need to modify for launching SageMaker Ground Truth labeling jobs. To enable SageMaker Ground Truth functionality, you should attach the AWS managed policy AmazonSageMakerGroundTruthExecution to your SageMaker Studio role. This policy provides the essential permissions for creating and managing labeling jobs.

For Amazon S3 access, scoping permissions to specific buckets and actions enhances security and aligns with best practices. This approach adheres to the principle of least privilege, reducing potential risks associated with overly permissive policies. The following is an example of a restricted Amazon S3 policy that grants only the necessary permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<YOUR-BUCKET-NAME>",
                "arn:aws:s3:::<YOUR-BUCKET-NAME>/*"
            ]
        }
    ]
}

To add these policies to your SageMaker Studio role, complete the following steps:

  1. On the IAM console, find and choose your SageMaker Studio role (it usually starts with AmazonSageMaker-ExecutionRole-).
  2. On the Permissions tab, choose Add permissions and then Attach policies.
  3. Search for and attach AmazonSageMakerGroundTruthExecution.
  4. Create and attach the custom Amazon S3 inline policy as shown in the preceding example, if needed.

Remember to follow the principle of least privilege, granting only the permissions necessary for your specific use case. Regularly review your IAM roles and policies to validate their alignment with your security requirements. For more details on IAM policies for SageMaker Ground Truth, refer to Use IAM Managed Policies with Ground Truth.

Set up the notebook and environment

To get started, open SageMaker Studio and create a JupyterLab space. For Instance, choose ml.g5.48xlarge. Run the space, open JupyterLab, and clone the code in the following GitHub repository. You can configure the JupyterLab space to use up to 100 GB in your Amazon Elastic Block Store (Amazon EBS) volume. In addition, the ml.g5 instance family comes with NVMe SSD local storage, which you can use in the JupyterLab application. The NVMe instance store directory is mounted to the application container in /mnt/sagemaker-nvme. For this post, you use the NVMe storage available in the ml.g5.48xlarge instance.

When your space is ready, clone the GitHub repo and open the notebook llama3/rlhf-genai-studio/RLHF-with-Llama3-on-Studio-DPO.ipynb, which contains the solution code. In the pop-up, make sure that the Python 3 kernel is selected.

Let’s go through the notebook. First, install the necessary Python libraries:

import torch
import os
import sagemaker
import boto3
import datetime
from transformers import pipeline
import json
import asyncio
import aiofiles
from datasets import Dataset, load_dataset
from peft import (
get_peft_model,
    LoraConfig,
    prepare_model_for_kbit_training,
)
import bitsandbytes as bnb
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForSequenceClassification
)
from IPython.core.display import display, HTML

The following line sets the default path where you store temporary artifacts to the location in the NVMe storage:

cache_dir = "/mnt/sagemaker-nvme"

This is local storage, which means that your data will be lost when the JupyterLab application is deleted, restarted, or patched. Alternatively, you can increase your EBS volume of your SageMaker Studio space to greater than or equal to 100 GB to provide sufficient storage for the Meta Llama 3 base model, PEFT adapter, and new merged fine-tuned model.

Load Meta Llama 3 8B Instruct in the notebook

After you have imported the necessary libraries, you can download the Meta Llama 3 8B Instruct model and its associated tokenizers from Hugging Face:

base_model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    token=hf_access_token,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    cache_dir=cache_dir
)

model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    token=hf_access_token,
    cache_dir=cache_dir
)

Collect initial model responses for common and toxic questions

The example_bank_questions.txt file contains a list of common questions received by call centers in financial organizations combined with a list of toxic and off-topic questions.

Before you ask the model to generate answers to these questions, you need to specify the brand and core values of Example Bank. You will include these values in the prompt as context later so the model has the appropriate information it needs to respond.

company_context = """Example Bank is a next-generation digital bank on a mission to revolutionize the banking experience. Founded in 2020, we are committed to leveraging cutting-edge technology to make banking simple, accessible, and transparent for everyone. In Example Bank, we believe that banking should be seamless, intuitive, and tailored to the needs of modern consumers. Our founders, seasoned professionals from the tech and finance industries, set out to create a bank that puts people first, empowering them to take control of their finances with ease. At Example Bank, we envision a world where banking is no longer a chore but a delightful experience. We are dedicated to breaking down barriers and democratizing access to financial services. Our goal is to empower individuals and businesses alike by providing them with the tools and resources they need to thrive in an increasingly digital landscape.
Our values:
- Innovation: We embrace cutting-edge technologies and continuously seek out innovative solutions to deliver the best possible banking experience. We are a digital-only bank, which means we don't have any physical branches. Instead, we offer all of our services online or through our mobile app. This allows us to keep our costs low and pass the savings on to our customers.
- Transparency: We are committed to being direct and honest with our customers. We believe that transparency is key to building trust, and we want our customers to feel confident that they are making informed decisions about their money. That's why we provide clear and concise information about our products and services, and we are always available to answer any questions our customers may have.
- Accessibility: Our services are designed to be inclusive and user-friendly, catering to a diverse range of customers, regardless of their financial backgrounds.
- Security: We prioritize the safety and security of our customers' data and assets, employing state-of-the-art encryption and cybersecurity measures.
In addition to our core values, Example Bank offers a range of innovative financial products and services:
- Loans: Whether you’re looking to buy a home, start a business, or finance a major purchase, our flexible loan options are designed to meet your needs. With competitive interest rates and a simple application process, obtaining a loan has never been easier.
- Credit Cards: Our credit cards come with a host of benefits including cashback rewards, low-interest rates, and no annual fees. Manage your spending effortlessly with real-time notifications and intuitive budgeting tools.
- Mobile Apps: Our user-friendly apps on the Google Play Store and Apple App Store offer a seamless banking experience. From checking balances to transferring funds, our apps ensure you have complete control of your finances at your fingertips.
- Savings and Investments: Grow your wealth with our high-yield savings accounts and a variety of investment options. Our financial advisors are available to help you make informed decisions tailored to your financial goals.
- Customer Support: We provide 24/7 customer support to assist with any inquiries or issues. Our dedicated team is always ready to help, ensuring you receive the best possible service at all times.
At Example Bank, we are committed to enhancing your financial well-being through innovation, transparency, and unparalleled service. Join us today and experience the future of banking.
"""

Now you’re ready to invoke the model. For each question in the file, you construct a prompt that contains the context and the actual question. You send the prompt to the model four times to generate four different outputs and save the results in the llm_responses.json file.

questions = 'example_bank_questions.txt'
llm_responses = os.path.join(sample_files_path, 'llm_responses.json')

from timeit import default_timer as timer
import tqdm.asyncio

async def invoke_model(question, context):
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    messages = [
        {"role": "user", "content": f"{context}: {question}"}
    ]

    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    response = pipe(
        messages, 
        max_new_tokens=120, 
        do_sample=True,
        temperature=gl_temperature, 
        top_p=gl_top_p, 
        eos_token_id=terminators
    )[0]['generated_text'][-1]
    return response['content']

async def process_lines(file_path):
    results = []
    context = f"""{company_context} You are a customer service agent for {company_name} Sometimes you are smart with your answers. Answer the following customer question in one or two sentences:
    """
    async with aiofiles.open(file_path, 'r') as file:
        lines = [line async for line in file]
        for line in tqdm.asyncio.tqdm(lines, desc="Processing Question Bank"):
            start = timer()
            responses = await asyncio.gather(*[invoke_model(line, context) for _ in range(4)])
            result = {
                'context': context,
                'question': line.strip(),
                'responses': responses
            }
            end = timer()
            results.append(result)
    return results

results = await process_lines(questions)

with open(llm_responses, 'w') as file:
    json.dump(
        results, 
        file, 
        indent=4
    )

The following is an example entry from llm_reponses.json.

Set up the SageMaker Ground Truth labeling job and human preference data

To fine-tune the model using DPO, you need to gather human preference data for the generated responses. SageMaker Ground Truth helps orchestrate the data collection process. It offers customizable labeling workflows and robust workforce management features for ranking tasks. This section shows you how to set up a SageMaker Ground Truth labeling job and invite a human workforce with requisite expertise to review the LLM responses and rank them.

Set up the workforce

A private workforce in SageMaker Ground Truth consists of individuals who are specifically invited to perform data labeling tasks. These individuals can be employees or contractors who have the required expertise to evaluate the model’s responses. Setting up a private workforce helps achieve data security and quality by limiting access to trusted individuals for data labeling.

For this use case, the workforce consists of the group of people who will rank the model responses. You can set up a private workforce using the SageMaker console by creating a private team and inviting members through email. For detailed instructions, refer to Create a Private Workforce (Amazon SageMaker Console).

Create the instruction template

With the instruction template, you can manage the UI and guide human annotators in reviewing model outputs. It needs to clearly present the model responses and provide a straightforward way for the annotators to rank them. Here, you use the text ranking template. This template allows you to display the instructions for the human reviewer and the prompts with the pregenerated LLM responses. The annotator reviews the prompt and responses and ranks the latter based on their alignment to the organization’s brand.

The definition of the template is as follows. The template shows a pane on the left with instructions from the job requester, a prompt at the top, and three LLM responses in the main body. The right side of the UI is where the annotator ranks the responses from most to least preferable.

  <html>
  <head>
    <meta charset="UTF-8" />
    <link rel="stylesheet" href="https://assets.crowd.aws/css/gen-ai-components.css" />
    <link rel="icon" href="data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text y=%22.9em%22 font-size=%2290%22>🥇</text></svg>" />
    <title>Text Ranking Tool</title>
    <script src="https://assets.crowd.aws/gen-ai-components.js"></script>
  </head>

  <body>
    <div>
      <crowd-text-ranking
        crowd-form-element-id="crowd-form-submit"
        instructions='Rank the following responses from a language model according to their alignment to the organisation's brand.'
        ordinal-ranking-dimensions='[{"name":"BrandValue","allowTie":true}]'
        text='{{ task.input.source }}'
        responses='{{ task.input.responses | to_json }}' />
    </div>
    <crowd-form id="crowd-form-submit" style="display: none"></crowd-form>
    <script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
  </body>
</html>

The template is saved locally on your Studio JupyterLab space EBS volume as instructions.template in a temporary directory. Then you upload this template file to your designated S3 bucket using s3.upload_file(), placing it in the specified bucket and prefix. This Amazon S3 hosted template will be referenced when you create the SageMaker Ground Truth labeling job, so workers see the correct interface for the text ranking task.

Preprocess the input data

Before you create the labeling job, verify that the input data matches the format expected by SageMaker Ground Truth and is saved as a JSON file in Amazon S3. You can use the prompts and responses in the llm_responses.json file to create the manifest file inp-manifest-trank.json. Each row in the manifest file contains a JSON object (source-responses pair). The previous entry now looks like the following code.

Upload the structured data to the S3 bucket so that it can be ingested by SageMaker Ground Truth.

Create the labeling job

Now you’re ready to configure and launch the labeling job using the SageMaker API from within the notebook. This involves specifying the work team, UI template, and data stored in the S3 bucket. By setting appropriate parameters such as task time limits and the number of workers per data object, you can run jobs efficiently and effectively. The following code shows how to start the labeling job:

sm_client.create_labeling_job(
    LabelingJobName=labeling_job_name,
    LabelAttributeName='label',
    InputConfig={
        'DataSource': {
            'S3DataSource': {
                'ManifestS3Uri': model_responses_s3_uri
            }
        }
    },
    OutputConfig={
        'S3OutputPath': 's3://{}/{}/output/'.format(bucket,prefix) #Enter S3 URI of Output folder
    },
    RoleArn=role, 
    HumanTaskConfig={
        'WorkteamArn': WORKTEAM_ARN,
        'UiConfig':{
            'UiTemplateS3Uri': UI_TEMPLATE_S3_URI
        },
        'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-PassThrough',
        'TaskKeywords': [
            'QnA',
        ],
        'TaskTitle': 'Rank LLM responses',
        'TaskDescription': "Rank the responses provided by the LLM",
        'NumberOfHumanWorkersPerDataObject': 1,
        'TaskTimeLimitInSeconds': 60*30,
        'TaskAvailabilityLifetimeInSeconds': 60*60*24*10,
        'MaxConcurrentTaskCount': 100,
        'AnnotationConsolidationConfig': {
            'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:ACS-PassThrough'
        } 
    }

As the job is launched, it’s essential to monitor its progress closely, making sure tasks are being distributed and completed as expected.

Gather human feedback through the labeling portal

When the job setup is complete, annotators can log in to the labeling portal and start ranking the model responses.

Workers can first consult the Instructions pane to understand the task, then use the main interface to evaluate and rank the model’s responses according to the given criteria. The following screenshot illustrates the UI.

The human feedback is collected and stored in an S3 bucket. This feedback will be the basis for DPO. With this data, you will fine-tune the Meta Llama 3 model and align its responses with the organization’s values, improving its overall performance.

Align Meta Llama 3 8B Instruct with the DPOTrainer

In this section, we show how to use the preference dataset that you prepared using SageMaker Ground Truth to fine-tune the model using DPO. DPO explicitly optimizes the model’s output based on human evaluations. It aligns the model’s behavior more closely with human expectations and improves its performance on tasks requiring nuanced understanding and contextual appropriateness. By integrating human preferences, DPO enhances the model’s relevance, coherence, and overall effectiveness in generating desired responses.

DPO makes it more straightforward to preference-tune a model in comparison to other popular techniques such as Proximal Policy Optimization (PPO). DPO eliminates the necessity for a separate rewards model, thereby avoiding the cost associated with training it. Additionally, DPO requires significantly less data to achieve performance comparable to PPO.

Fine-tuning a language model using DPO consists of two steps:

  1. Gather a preference dataset with positive and negative selected pairs of generation, given a prompt.
  2. Maximize the log-likelihood of the DPO loss directly.

To learn more about the DPO algorithm, refer to the following whitepaper.

Expected data format

The DPO trainer expects a very specific format for the dataset, which contains sentence pairs where one sentence is a chosen response and the other is a rejected response. This is represented as a Python dictionary with three keys:

  • prompt – Consists of the context prompt given to a model at inference time for text generation
  • chosen – Contains the preferred generated response to the corresponding prompt
  • rejected – Contains the response that is not preferred or should not be the sampled response for the given prompt

The following function definition illustrates how to process the data stored in Amazon S3 to create a DPO dataset using with sample pairs and a prompt:

def return_prompt_and_responses(samples, index):
    prompt = f"{samples['context']}nn{samples['question']}"
    chosen_index = response_rankings[index]["responseRankings"].index(1)
    rejected_index = response_rankings[index]["responseRankings"].index(4)

    prompt = {"role": "user", "content": prompt},

    chosen_messages = [
        {"role": "assistant", "content": samples["responses"][chosen_index]},
    ]
    rejected_messages = [
        # {"role": "system", "content": prompt},
        {"role": "assistant", "content": samples["responses"][rejected_index]},
    ]
    
    return {
        "prompt": tokenizer.apply_chat_template(prompt, tokenize=False),
        "chosen": "{}".format(tokenizer.apply_chat_template(chosen_messages, tokenize=False).replace('<|begin_of_text|>', '')),
        "rejected": "{}".format(tokenizer.apply_chat_template(rejected_messages, tokenize=False).replace('<|begin_of_text|>', ''))
    }

Here is an example sentence pair:

You split the DPO trainer dataset into train and test samples using an 80/20 split and tokenize the dataset in preparation for DPO fine-tuning:

dataset = prepared_dataset.train_test_split(test_size=0.2)

dataset["train"].to_json(
    os.path.join(sample_files_path, "processed_human_feedback", "train_dataset.json"), 
    orient="records", 
    index="False"
)

dataset["test"].to_json(
    os.path.join(sample_files_path, "processed_human_feedback", "test_dataset.json"), 
    orient="records", 
    index="False"

Supervised fine-tuning using DPO

Now that the dataset is formatted for the DPO trainer, you can use the train and test datasets prepared earlier to initiate the DPO model fine-tuning. Meta Llama 3 8B belongs to a category of small language models, but even Meta Llama 3 8B barely fits into a SageMaker ML instance like ml.g5.48xlarge in fp16 or fp32, leaving little room for full fine-tuning. You can use PEFT with DPO to fine-tune Meta Llama 3 8B’s responses based on human preferences. PEFT is a method of fine-tuning that focuses on training only a subset of the pre-trained model’s parameters. This approach involves identifying the most important parameters for the new task and updating only those parameters during training. By doing so, PEFT can significantly reduce the computation required for fine-tuning. See the following code:

# configure PEFT module
peft_config = LoraConfig(
    r=512,
    lora_alpha=1024,
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
    target_modules="all-linear",

For a full list of LoraConfig training arguments, refer to LoRA. At a high level, you need to initialize the DPOTrainer with the following components: the model you want to train, a reference model (ref_model) used to calculate the implicit rewards of the preferred and rejected responses, the beta hyperparameter that controls the balance between the implicit rewards assigned to the preferred and rejected responses, and a dataset containing prompt, chosen, and rejected responses. If ref_model=None, the trainer will create a reference model with the same architecture as the input model to be optimized. See the following code:

from trl import DPOConfig, DPOTrainer

dpo_model_dir = "/path/to/save/dpo/model"

args = DPOConfig(
    output_dir=dpo_model_dir,               # directory to save and repository id
    num_train_epochs=5,                     # number of training epochs
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim = "adamw_torch_fused",            # use fused adamw optimizer
    learning_rate=1e-5,                     # 10x higher LR than QLoRA paper
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.1,                       # warmup ratio based on QLoRA paper
    lr_scheduler_type="cosine",             # use cosine learning rate scheduler
    logging_steps=10,                       
    save_steps=10,                         # when to save checkpoint
    evaluation_strategy="steps",            
    eval_steps=100,
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    push_to_hub=False,                      # push model to hub,
    report_to='tensorboard',
    remove_unused_columns=False
)

dpo_args = {
    "beta": 0.1,                            # The beta factor in DPO loss. Higher beta means less divergence
    "loss_type": "sigmoid"                  # The loss type for DPO.
}

trainer = DPOTrainer(
    model,
    ref_model=None,
    peft_config=peft_config,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    max_length=max_seq_length,
    max_prompt_length=prompt_length,
    beta=dpo_args["beta"],
    loss_type=dpo_args["loss_type"],
)

# kick-off model training
trainer.train()

Once you start the training, you can see the status in the notebook:

When model fine-tuning is complete, save the PEFT adapter model to disk and merge it with the base model to create a newly tuned model. You can use the saved model for local inference and validation or deploy it as a SageMaker endpoint after you have gained sufficient confidence in the model’s responses.

peft_output_dir = "/path/to/save/tuned/model/"
print(f"saving peft model to: {peft_output_dir}")
trainer.save_model(output_dir=peft_output_dir)
...
...
merged_model = model.merge_and_unload()
...
...
merged_model.save_pretrained(
    new_dpo_output_dir,
    safe_serialization=True,
    max_shard_size="9GB"
)

Evaluate the fine-tuned model inside a SageMaker Studio notebook

Before you host your model for inference, verify that its response optimization aligns with user preferences. You can collect the model’s response both before and after DPO fine-tuning and compare them side by side, as shown in the following table.

The DPO Model Response column indicates the RLHF aligned model’s response post-fine-tuning, and the Rejected Model Response column refers to the model’s response to the input prompt prior to fine-tuning.

Deploy the model to a SageMaker endpoint

After you have gained sufficient confidence in your model, you can deploy it to a SageMaker endpoint for real-time inference. SageMaker endpoints are fully managed and provide auto scaling capabilities. For this post, we use DJL Serving to host the fine-tuned, DPO-aligned Meta Llama3 8B model. To learn more about hosting your LLM using DJL Serving, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

To deploy an LLM directly from your SageMaker Studio notebook using DJL Serving, complete the following steps:

  1. Upload model weights and other model artifacts to Amazon S3.
  2. Create a meta-model definition file called serving.properties. This definition file dictates how the DJL Serving container is configured for inference.

engine = DeepSpeed
option.tensor_parallel_degree = 1
option.s3url = s3://<MY-TEST-BUCKET>/llama3-dpo-ft/modelweights
option.hf_access_token=hf_xx1234

  1. Create a custom inference file called model.py, which defines a custom inference logic:
%%writefile llama3-serving-model/model.py

from djl_python import Input, Output
...

predictor = None


def get_model(properties):

    ...
    return generator


def handle(inputs: Input) -> None:
    ...
    outputs = predictor(message, **generation_kwargs)[0]['generated_text'][-1]
    result = {"outputs": outputs['content']}
    return Output().add(result)
  1. Deploy the DPO fine-tuned model as a SageMaker endpoint:
from sagemaker import image_uris
from sagemaker.model import Model
from datetime import datetime

inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed",
    region=region,
    version="0.23.0"
)

...

dpo_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    endpoint_name=f"ep-{dpo_model.name}",
    container_startup_health_check_timeout=900,
    wait=False, # <-- Set to True, if you would prefer to wait 6-8 minutes for the endpoint to spin up
)
  1. Invoke the hosted model for inference using the sageMaker.Predictor class:
dpo_ft_predictor = sagemaker.Predictor(
    endpoint_name="my_custom_dpo_endpoint",
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)
...
# invoke inference
response = dpo_ft_predictor.predict(
    {
        "inputs": content,
        "parameters": parameters
    }
)

Clean up

After you complete your tasks in the SageMaker Studio notebook, remember to stop your JupyterLab workspace to prevent incurring additional charges. You can do this by choosing Stop next to your JupyterLab space. Additionally, you have the option to set up lifecycle configuration scripts that will automatically shut down resources when they’re not in use.

If you deployed the model to a SageMaker endpoint, run the following code at the end of the notebook to delete the endpoint:

#delete your endpoint
sm_client.delete_endpoint(EndpointName=tg_sm_model.endpoint_name)

Conclusion

Amazon SageMaker offers tools to streamline the process of fine-tuning LLMs to align with human preferences. With SageMaker Studio, you can experiment interactively with different models, questions, and fine-tuning techniques. With SageMaker Ground Truth, you can set up workflows, manage teams, and collect consistent, high-quality human feedback.

In this post, we showed how to enhance the performance of Meta Llama 3 8B Instruct by fine-tuning it using DPO on data collected with SageMaker Ground Truth. To get started, launch SageMaker Studio and run the notebook available in the following GitHub repo. Share your thoughts in the comments section!


About the Authors

Anastasia Tzeveleka is a GenAI/ML Specialist Solutions Architect at AWS. As part of her work, she helps customers build foundation models and create scalable generative AI and machine learning solutions using AWS services.

Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes. In his free time, he enjoys playing chess and traveling.

Sundar Raghavan is an AI/ML Specialist Solutions Architect at AWS, helping customers build scalable and cost-efficient AI/ML pipelines with Human in the Loop services. In his free time, Sundar loves traveling, sports and enjoying outdoor activities with his family.

Read More

Amazon EC2 P5e instances are generally available

Amazon EC2 P5e instances are generally available

State-of-the-art generative AI models and high performance computing (HPC) applications are driving the need for unprecedented levels of compute. Customers are pushing the boundaries of these technologies to bring higher fidelity products and experiences to market across industries.

The size of large language models (LLMs), as measured by the number of parameters, has grown exponentially in recent years, reflecting a significant trend in the field of AI. Model sizes have increased from billions of parameters to hundreds of billions of parameters within a span of 5 years. As LLMs have grown larger, their performance on a wide range of natural language processing tasks has also improved significantly, but the increased size of LLMs has led to significant computational and resource challenges. Training and deploying these models requires vast amounts of computing power, memory, and storage.

The size of an LLM has a significant impact on the choice of compute needed for inference. Larger LLMs require more GPU memory to store the model parameters and intermediate computations, as well as greater computational power to perform the matrix multiplications and other operations needed for inference. Large LLMs take longer to perform a single inference pass due to this increased computational complexity. This increased compute requirement can lead to higher inference latency, which is a critical factor for applications that require real-time or near real-time responses.

HPC customers exhibit similar trends. With the fidelity of HPC customer data collection increasing and datasets reaching exabyte scale, customers are looking for ways to enable faster time to solution across increasingly complex applications.

To address customer needs for high performance and scalability in deep learning, generative AI, and HPC workloads, we are happy to announce the general availability of Amazon Elastic Compute Cloud (Amazon EC2) P5e instances, powered by NVIDIA H200 Tensor Core GPUs. AWS is the first leading cloud provider to offer the H200 GPU in production. Additionally, we are announcing that P5en instances, a network optimized variant of P5e instances, are coming soon.

In this post, we discuss the core capabilities of these instances and the use cases they’re well-suited for, and walk you through an example of how to get started with these instances and carry out inference deployment of Meta Llama 3.1 70B and 405B models on them.

EC2 P5e instances overview

P5e instances are powered by NVIDIA H200 GPUs with 1.7 times more GPU memory capacity and 1.5 times faster GPU memory bandwidth as compared to NVIDIA H100 Tensor Core GPUs featured in P5 instances.

P5e instances incorporate 8 NVIDIA H200 GPUs with 1128 GB of high bandwidth GPU memory, 3rd Gen AMD EPYC processors, 2 TiB of system memory, and 30 TB of local NVMe storage. P5e instances also provide 3,200 Gbps of aggregate network bandwidth with support for GPUDirect RDMA, enabling lower latency and efficient scale-out performance by bypassing the CPU for internode communication.

The following table summarizes the details for the instance.

Instance Size vCPUs Instance Memory (TiB) GPU GPU memory Network Bandwidth (Gbps) GPUDirect RDMA GPU Peer to Peer Instance Storage (TB) EBS Bandwidth (Gbps)
p5e.48xlarge 192 2 8 x NVIDIA H200 1128 GB
HBM3e
3200 Gbps EFA Yes 900 GB/s NVSwitch 8 x 3.84 NVMe SSD 80

EC2 P5en instances coming soon

One of the bottlenecks in GPU-accelerated computing may lie in the communication between CPUs and GPUs. The transfer of data between these two components can be time-consuming, especially for large datasets or workloads that require frequent data exchanges. This challenge could impact wide range of GPU-accelerated applications such as deep learning, high-performance computing, and real-time data processing. The need to move data between the CPU and GPU can introduce latency and reduce the overall efficiency. Additionally, network latency can become an issue for ML workloads on distributed systems, because data needs to be transferred between multiple machines.

EC2 P5en instances, coming soon in 2024, can help solve these challenges. P5en instances pair the NVIDIA H200 GPUs with custom 4th Generation Intel Xeon Scalable processors, enabling PCIe Gen 5 between CPU and GPU. These instances will provide up to four times the bandwidth between CPU and GPU and lower network latency, thereby improving workload performance.

P5e use cases

P5e instances are ideal for training, fine-tuning, and running inference for increasingly complex LLMs and multimodal foundation models (FMs) behind the most demanding and compute-intensive generative AI applications, including question answering, code generation, video and image generation, speech recognition, and more.

Customers deploying LLMs for inference can benefit from using P5e instances, which offer several key advantages that make them an excellent choice for these workloads.

Firstly, the higher memory bandwidth of the H200 GPUs in the P5e instances allows the GPU to fetch and process data from memory more quickly. This translates to reduced inference latency, which is critical for real-time applications like conversational AI systems where users expect near-instant responses. The higher memory bandwidth also enables higher throughput, allowing the GPU to process more inferences per second. Customers deploying the 70-billion-parameter Meta Llama 3.1 model on P5e instances can expect up to 1.871 times higher throughput and up to 40%1 lower cost compared to using comparable P5 instances. (1Input Sequence Length 121, Output Sequence Length 5000, batch size 10, vLLM framework)

Secondly, the massive scale of modern LLMs, with hundreds of billions of parameters, requires an immense amount of memory to store the model and intermediate computations during inference. On the standard P5 instances, this would likely necessitate the use of multiple instances to accommodate the memory requirements. However, the P5e instances’ 1.76 times higher GPU memory capacity enables you to scale up by using a single instance to fit the entire model. This avoids the complexity and overhead associated with distributed inference systems, such as data synchronization, communication, and load balancing. Customers deploying the 405-billion-parameter Meta Llama 3.1 model on a single P5e instance can expect up to 1.72 times higher throughput and up to 69%2 lower cost compared to using two P5 instances. (2Input Sequence Length 121, Output Sequence Length 50, batch size 10, vLLM framework)

Finally, the higher GPU memory of the P5e instances also enables the use of larger batch sizes during inference for better GPU utilization, resulting in faster inference times and higher overall throughput. This additional memory can be particularly beneficial for customers with high-volume inference requirements.

When optimizing inference throughput and cost, consider adjusting batch size, input/output sequence length, and quantization level, because these parameters can have a substantial impact. Experiment with different configurations to find the optimal balance between performance and cost for your specific use case.

In summary, the combination of higher memory bandwidth, increased GPU memory capacity, and support for larger batch sizes make the P5e instances an excellent choice for customers deploying LLM inference workloads. These instances can deliver significant performance improvements, cost savings, and operational simplicity compared to alternative options.

P5e instances are also well-suited for memory-intensive HPC applications like simulations, pharmaceutical discovery, seismic analysis, weather forecasting, and financial modeling. Customers using dynamic programming (DP) algorithms for applications like genome sequencing or accelerated data analytics can also see further benefit from P5e through support for the DPX instruction set.

Get started with P5e instances

When launching P5 instances, you can use AWS Deep Learning AMIs (DLAMI) to support P5 instances. DLAMI provides ML practitioners and researchers with the infrastructure and tools to quickly build scalable, secure, distributed ML applications in preconfigured environments. You can run containerized applications on P5 instances with AWS Deep Learning Containers using libraries for Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS).

P5e instances now available

EC2 P5e instances are now available in the US East (Ohio) AWS Region in the p5e.48xlarge sizes through Amazon EC2 Capacity Blocks for ML. For more information, refer to Amazon EC2 P5 Instances.


About the authors

Avi Kulkarni is an Senior Specialist focusing on worldwide business development and go-to-market for ML and HPC workloads across both commercial and public sector customers. Previously, he has managed partnerships at AWS and led product management for automotive customers at Honeywell, covering electrified, autonomous, and traditional vehicles.

Karthik Venna is a Principal Product Manager at AWS. He leads development of EC2 instances for a wide variety of workloads including deep learning and generative AI.

Khaled Rawashdeh is a Senior Product Manager at AWS. He defines and creates Amazon EC2 accelerated computing instances for most demanding AI/machine learning workloads. Before joining AWS, he worked for leading companies focusing on creating datacenter software and system for enterprise customers.

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Pavel Belevich is a Senior Applied Scientist in the ML Frameworks team at Amazon Web Services. He applies his research in distributed training and inference of large models to real-life customer needs. Before joining AWS Pavel worked in PyTorch Distributed team on various distributed training techniques such as FSDP and Pipeline parallelism.

Dr. Maxime Hugues is a Principal WW Specialist Solutions Architect GenAI at AWS, which he joined in 2020. He holds a M.E. from the French National Engineer School “ISEN-Toulon”, a M.S. degree from the University of Science and a Ph.D. degree in Computer Science in 2011 from the University of Lille 1. His researches were mainly focused on programming paradigms, innovative hardware for Extreme computers and performance of HPC/Machine Learning. Prior joining AWS, he worked as HPC Research Scientist and Tech lead at TotalEnergies.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Exploring data using AI chat at Domo with Amazon Bedrock

Exploring data using AI chat at Domo with Amazon Bedrock

This post is co-written with Joe Clark from Domo.

Data insights are crucial for businesses to enable data-driven decisions, identify trends, and optimize operations. Traditionally, gaining these insights required skilled analysts using specialized tools, which can make the process slow and less accessible.

Generative artificial intelligence (AI) has revolutionized this by allowing users to interact with data through natural language queries, providing instant insights and visualizations without needing technical expertise. This can democratize data access and speed up analysis.

However, companies can face challenges when using generative AI for data insights, including maintaining data quality, addressing privacy concerns, managing model biases, and integrating AI systems with existing workflows.

Domo is a cloud-centered data experiences innovator that empowers users to make data-driven decisions. Powered by AI and data science, Domo’s user-friendly dashboards and apps make data actionable, driving exponential business impact. Domo connects, transforms, visualizes, and automates data through simple integrations and intelligent automation, strengthening the entire data journey.

In this post, we share how Domo uses Amazon Bedrock to provide a flexible and powerful AI solution.

Domo’s purpose of using generative AI

The Domo enterprise data environment caters to a diverse customer base with varying data-driven requirements. Domo works with organizations that place a strong emphasis on deriving actionable insights from their data assets. Domo’s existing solution already enables these organizations to extract valuable insights through data visualization and analysis. The next step is to provide them with a more intuitive and conversational interface to interact with their data, empowering them to generate meaningful visualizations and reports through natural language interactions.

Domo.AI powered by Amazon Bedrock

Domo.AI simplifies data exploration and analysis by intelligently guiding you at every turn, from data preparation to forecasting to automation. It does this with natural language conversation, contextual and personalized insights with narrative and visual responses, and robust security and governance for a guided risk control experience.

Domo’s AI Service Layer is the foundation of the Domo.AI experience. Domo uses the Domo AI Service Layer with Amazon Bedrock to provide customers with a flexible and powerful AI solution. The AI Service Layer allows Domo to switch between different models provided by Amazon Bedrock for individual tasks and track their performance across key metrics like accuracy, latency, and cost. This enables Domo to optimize model performance through prompt engineering, preprocessing, and postprocessing, and provide contextual information and examples to the AI system. The AI Service Layer and its integration with Amazon Bedrock empower Domo to offer their customers the tools they need to harness AI throughout their organization, from data exploration using natural language-driven AI chat to custom applications and automations powered by a variety of AI models.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. With Amazon Bedrock, you can quickly experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that orchestrate tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you’re already familiar with.

Solution overview

The following diagram illustrates the solution architecture and data flow.

The workflow includes the following steps:

  1. End-users interact with Domo.AI either through their website or mobile app. The end-user request first goes through an AI chat agent. The AI chat agent uses the capability of large language models (LLMs) to interpret user input, determine how to solve the user question or request using available tools, and form a final response. The request goes through guardrails, which are mechanisms and strategies to enforce the responsible, ethical, and safe use of the AI model. This helps make sure the responses generated by the AI chat agent are aligned with the organization’s responsible AI policies and don’t contain inappropriate or harmful content. Domo uses custom business logic to implement safeguards in their generative AI applications that are customized to their customers’ use cases and responsible AI policies.
  2. The Agent Planner component is responsible for orchestrating the various tasks required to fulfill the end-user request. It calls the Amazon Bedrock service through an API to create an execution plan, which involves selecting the appropriate tools and models to retrieve relevant information or perform custom actions. The tools refer to the various capabilities or actions that the AI chat agent can use to gather information and perform tasks. The tools provide the agent with access to data and functionality beyond what is available in the underlying LLM.
  3. The Tool Execution component is the process of invoking the selected tools and integrating their outputs to generate the final response. This allows the agent to go beyond the knowledge contained in the LLM and incorporate up-to-date information or perform domain-specific operations.
  4. As tools are run, user input is used to find semantically relevant information using vector search or to query private data from sources such as Amazon Redshift using Domo Cloud Amplifier, which is a native integration with cross-cloud systems to unlock data products at the speed businesses needs them. Vector search is a technique used to find semantically relevant information from unstructured data sources, such as knowledge base articles or other documents. By creating vector embeddings of the content, the AI chat agent can efficiently search for and retrieve information that is most relevant to the user’s query, even if the exact phrasing isn’t present in the source material.
  5. Information such as search or query results from Step 3 is returned to the AI chat agent, where either the agent solver component can aggregate results to formulate a final response, or, in the case of more complex queries, the agent can run another tool.
  6. Each of the components in the solution use the Domo AI Service Layer with Amazon Bedrock for planning and reasoning capabilities, converting user questions into SQL queries, or creating embeddings for vector search, and returning results in a natural language answer to the user’s question grounded by private customer data.

The following video of Domo.AI provides a more detailed overview of the product’s key features and capabilities.

Why Domo chose Amazon Bedrock

Domo chose Amazon Bedrock for the following benefits and features:

  • Model choice – Amazon Bedrock provides Domo with access to a wide range of models, including best-in-class options and those from various providers such as Anthropic, AI21 Labs, Cohere, Meta, and Stability AI. This variety allows Domo to extensively test their services using different models, enabling them to select the most suitable option for each specific use case. As a result, Domo can accelerate their development process and deliver value to their customers more rapidly by taking advantage of this flexibility in model selection and experimentation.
  • Security, compliance, and global infrastructure – Amazon Bedrock addresses crucial security and compliance concerns for Domo and their customers. With Amazon Bedrock, Domo makes sure that data remains within the AWS hosting environment, helping prevent model providers from accessing or training on customer data. With encryption in transit and at rest, along with restricted access to model deployment accounts, Amazon Bedrock provides robust data protection. Additionally, Domo has implemented multiple guardrails with varied control combinations to suit different applications and use cases. Amazon Bedrock offers a single API for inference, which facilitates secure communication between users and the FM. Additionally, the global infrastructure and compliance features of Amazon Bedrock enable Domo to scale and deploy their generative AI applications worldwide while adhering to data privacy laws and best practices.
  • Cost – By using Amazon Bedrock, Domo has achieved significant cost savings, reporting a 50% reduction compared to similar models from other providers. The serverless access to high-quality LLMs eliminates the need for substantial upfront infrastructure investments typically associated with LLMs. This cost-effective approach allows Domo to experiment with and test various models without incurring the hefty expenses usually linked to LLM implementation and maintenance, thereby optimizing their resource allocation and improving overall operational efficiency.

In the following video, Joe Clark, Software Architect at Domo, shares how AWS has been instrumental for Domo in the generative AI space.

Getting started with Amazon Bedrock

With Amazon Bedrock, teams and individuals can immediately start using FMs without having to worry about provisioning infrastructure or setting up and configuring ML frameworks.

Before you get started, verify that your user or role has permission to create or modify Amazon Bedrock resources. For details, see Identity-based policy examples for Amazon Bedrock.

To access the models in Amazon Bedrock, on the Amazon Bedrock console, choose Model access in the navigation pane. Review the EULA and enable the FMs you’d like in your account.

You can start interacting with the FMs through the following methods:

Conclusion

Amazon Bedrock has been instrumental in enhancing data insights and visualization capabilities at Domo through generative AI. By providing flexibility in FM selection, secure access, and a fully managed experience, Amazon Bedrock has enabled Domo to deliver more value to their customers while reducing costs. The service’s security and compliance features have also allowed Domo to serve customers in highly regulated industries. By using Amazon Bedrock, Domo has seen a 50% reduction in cost compared to a similarly performing model from another provider.

If you’re ready to start building your own FM innovation with Amazon Bedrock, refer to Getting started with Amazon Bedrock. To learn more about other intriguing Amazon Bedrock applications, see the Amazon Bedrock section of the AWS Machine Learning Blog.


About the Authors

Joe Clark is software architect for the Domo Labs team and lead architect for Domo’s AI Service Layer, AI Chat, and Model Management. At Domo, Joe has also led development of features including Jupyter Workspaces, Sandbox, and Code Engine. With 15 years of professional software development experience, he has previously worked on IoT and smart city initiatives.

Aman Tiwari is a General Solutions Architect working with independent software vendors in the data and generative AI vertical at AWS. He helps them design innovative, resilient, and cost-effective solutions using AWS services. He holds a master’s degree in Telecommunications Networks from Northeastern University. Outside of work, he enjoys playing lawn tennis and reading books.

Sindhu Jambunathan is a Senior Solutions Architect at AWS, specializing in supporting ISV customers in the data and generative AI vertical to build scalable, reliable, secure, and cost-effective solutions on AWS. With over 13 years of industry experience, she joined AWS in May 2021 after a successful tenure as a Senior Software Engineer at Microsoft. Sindhu’s diverse background includes engineering roles at Qualcomm and Rockwell Collins, complemented by a Master’s of Science in Computer Engineering from the University of Florida. Her technical expertise is balanced by a passion for culinary exploration, travel, and outdoor activities.

Mohammad Tahsin is an AI/ML Specialist Solutions Architect at Amazon Web Services. He lives for staying up to date with the latest technologies in AI/ML and helping guide customers to deploy bespoke solutions on AWS. Outside of work, he loves all things gaming, digital art, and cooking.

Read More

How Vidmob is using generative AI to transform its creative data landscape

How Vidmob is using generative AI to transform its creative data landscape

This post was co-written with Mickey Alon from Vidmob.

Generative artificial intelligence (AI) can be vital for marketing because it enables the creation of personalized content and optimizes ad targeting with predictive analytics. Specifically, such data analysis can result in predicting trends and public sentiment while also personalizing customer journeys, ultimately leading to more effective marketing and driving business. For example, insights from creative data (advertising analytics) using campaign performance can not only uncover which creative works best but also help you understand the reasons behind its success.

In this post, we illustrate how Vidmob, a creative data company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to uncover meaningful insights at scale within creative data using Amazon Bedrock. The collaboration involved the following steps:

  • Use natural language to analyze and generate insights on performance data through different channels (such as TikTok, Meta, and Pinterest)
  • Generate research information for context such as the value proposition, competitive differentiators, and brand identity of a specific client

Vidmob background

Vidmob is the Creative Data company that uses creative analytics and scoring software to make creative and media decisions for marketers and agencies as they strive to drive business results through improved creative effectiveness. Vidmob’s influence lies in its partnerships and native integrations across the digital ad landscape, its dozens of proprietary models, and operating a reinforcement learning with human feedback (RLHF) model for creativity.

Vidmob’s AI journey

Vidmob uses AI to not only enhance its creative data capabilities, but also pioneer advancements in the field of RLHF for creativity. By seamlessly integrating AI models such as Amazon Rekognition into its innovative stack, Vidmob has continually evolved to stay at the forefront of the creative data landscape.

This journey extends beyond the mere adoption of AI; Vidmob has consistently recognized the importance of curating a differentiated dataset to maximize the potential of its AI-driven solutions. Understanding the intrinsic value of data network effects, Vidmob constructed a product and operational system architecture designed to be the industry’s most comprehensive RLHF solution for marketing creatives.

Use case overview

Vidmob aims to revolutionize its analytics landscape with generative AI. The central goal is to empower customers to directly query and analyze their creative performance data through a chat interface. Over the past 8 years, Vidmob has amassed a wealth of data that provides deep insights into the value of creatives in ad campaigns and strategies for enhancing performance. Vidmob envisions making it effortless for customers to utilize this data to generate insights and make informed decisions about their creative strategies.

Currently, Vidmob and its customers rely on creative strategists to address these questions at the brand level, complemented by machine-generated normative insights at the industry or environment level. This process can take creative strategists many hours. To enhance the customer experience, Vidmob decided to partner with AWS GenAIIC to deliver these insights more quickly and automatically.

Vidmob partnered with AWS GenAIIC to analyze ad data to help Vidmob creative strategists understand the performance of customer ads. Vidmob’s ad data consists of tags created from Amazon Rekognition and other internal models. The chatbot built by AWS GenAIIC would take in this tag data and retrieve insights.

The following were key success criteria for the collaboration:

  • Analyze and generate insights in a natural language based on performance data and other metadata
  • Generate client company information to be used as initial research for a creative
  • Create a scalable solution using Amazon Bedrock that can be integrated with Vidmob’s performance data

However, there were a few challenges in achieving these goals:

  • Large language models (LLMs) are limited in the volume of data they can analyze to generate insights without hallucination. They are designed to predict and summarize text-based information and are less optimized for computing creative data at a terabyte scale.
  • LLMs don’t have straightforward automatic evaluation techniques. Therefore, human evaluation was required for insights generated by the LLM.
  • There are 50–100 creative questions that creative strategists would normally analyze, which means an asynchronous mechanism was needed that would queue up these prompts, aggregate them, and provide the top-most meaningful insights.

Solution overview

The AWS team worked with Vidmob to build a serverless architecture for handling incoming questions from customers. They used the following services in the solution:

The following diagram illustrates the high-level workflow of the current solution:

Architecture Diagram for Vidmob

The workflow consists of the following steps:

  1. The user navigates to Vidmob and asks a creative-related query.
  2. Dynamo DB stores the query and the session ID, which is then passed to a Lambda function as a DynamoDB event notification.
  3. The Lambda function calls Amazon Bedrock, obtains an output from the user query, and sends it back to the Streamlit application for the user to view.
  4. The Lambda function updates the status after it receives the completed output from Amazon Bedrock.
  5. In the following sections, we explore the details of the workflow, the dataset, and the results Vidmob achieved.

Workflow details

After the user inputs a query, a prompt is automatically created and then fed into a QA chatbot in which a response is outputted. The main aspects of the LLM prompt include:

  •  Client description – Background information about the client. This includes the value proposition, brand identity, and competitive differentiators, which is generated by Anthropic Claude v2 on Amazon Bedrock.
  • Aperture – Important aspects to take into account for a user question. For example, for all questions relating to branding, “What is the best way to incorporate branding for my meta creative” might identify elements that include a logo, tagline, and sincere tone.
  • Context – The filtered dataset of ad performance referenced by the QA bot.
  • Question – The user query.

The following screenshot shows the UI where the user can input the client and their ad-related question.

On the backend, a router is used to determine the context (ad-related dataset) as a reference to answer the question. This depends on the question and the client, which is done in the following steps:

  1. Determine whether the question should reference the objective dataset (general for an entire channel like TikTok, Meta, Pinterest) or placement dataset (specific sub-channels like Facebook Reels). For example, “What is the best way to incorporate branding in my Meta creative” is objective-based, whereas “What is the best way to incorporate branding for Facebook News Feed” is placement-based because it references a specific part of the Meta creative.
  2. Obtain the corresponding objective dataset for the client if the query is objective-based. If it’s placement-based, first filter the placement dataset to only columns that are relevant to the query and then pass in the resulting dataset.
  3. Pass the completed prompt to the Anthropic’s Claude v2 model on Amazon Bedrock and display the outputs.

The outputs are displayed as shown in the following screenshot.

Specifically, the outputs include the elements that best answer the question, why this element may be important, and its corresponding percent lift for the creative.

Dataset

The dataset includes a set of ad-related data corresponding to a specific client. Specifically, Vidmob analyzes the client ad campaigns and extracts information related to the ads using various machine learning (ML) models and AWS services. The information about each campaign is collated into a single dataset (creative data). It notes how each element of a given creative performs under a certain metric; for example, how the CTA affects the view-through rate of the ad. The following two datasets were utilized:

  • Creative strategist filtered performance data for each question – The dataset given was filtered by Vidmob creative strategists for their analysis. The filtered datasets include an element (such as logo or bright colors for a creative) as well as its corresponding average, percent lift (of a particular metric such as view-through rate), creative count, and impressions for each sub-channel (Facebook Explore, Reels, and so on).
  • Unfiltered raw datasets – This dataset included objective-based and placement-based data for each client.

As we discussed earlier, there are two types of datasets for a particular client: objective-based and placement-based data. Objective data is used for answering generic user queries about ads for channels such as TikTok, Meta, or Pinterest, whereas placement data is used for answering specific questions about ads for sub-channels within Meta such as Facebook Reels, Instream, and News Feed. Therefore, questions such as “What are creative insights in my Meta creative” are more general and therefore reference the objective data, and questions such as “What are insights for Facebook News Feed” reference the News Feed statistics in the placement data.

The objective dataset includes elements and their corresponding average percent lift, creative count, p-values, and many more for an entire channel, whereas placement data includes these same statistics for each sub-channel.

Results

A set of questions were evaluated by the strategists for Vidmob, primarily for the following metrics:

  • Accuracy – How correct the overall answer is with what you expect to be
  • Relevancy – How relevant the LLM-generated output to the question is (or in this case, the background information for the client)
  • Clarity – How clear and understandable the outputs from the performance data and their insights are, or if the LLM is making up things

The client background information for the prompt and a set of questions for the filtered and unfiltered data were evaluated.

Overall, the client background, generated by Anthropic’s Claude, outputted the value proposition, brand identity, and competitive differentiator for a given client. The accuracy and clarity were perfect, whereas relevancy was perfect for most samples. Perfect is determined as being given a 9/10 or 10/10 on the specific metrics by subject matter experts.

When answering a set of questions, the responses generally had high clarity and AWS GenAIIC was able to incrementally improve the QA chatbot’s accuracy and relevancy by adding extra tag information to filter the data by 10% and 5%, respectively. Overall, Vidmob expects a reduction in generating insights for creative campaigns from hours to minutes.

Conclusion

In this post, we shared how the AWS GenAIIC team used Anthropic’s Claude on Amazon Bedrock to extract and summarize insights from Vidmob’s performance data using zero-shot prompt engineering. With these services, creative strategists were able to understand client information through inherent knowledge of the LLM as well as answer user queries through added client background information and tag types such as messaging and branding. Such insights can be retrieved at scale and utilized for enhancing effective ad campaigns.

The success of this engagement allowed Vidmob an opportunity to use generative AI to create more valuable insights for customers in reduced time, allowing for a more scalable solution.

This is just one of the ways AWS enables builders to deliver generative AI-based solutions. You can get started with Amazon Bedrock and see how it can be integrated in example code bases today. If you’re interested in working with the AWS Generative AI Innovation Center, reach out to AWS GenAIIC.


About the Authors

Mickey Alon is a serial entrepreneur and co-author of ‘Mastering Product-Led Growth.’ He co-founded Gainsight PX (Vista) and Insightera (Adobe), a real-time personalization engine. He previously led the global product development team at Marketo (Adobe) and currently serves as the CPTO at Vidmob, a leading creative intelligence platform powered by GenAI.

Suren Gunturu is a Data Scientist working in the Generative AI Innovation Center, where he works with various AWS customers to solve high-value business problems. He specializes in building ML pipelines using Large Language Models, primarily through Amazon Bedrock and other AWS Cloud services.

Gaurav Rele is a Senior Data Scientist at the Generative AI Innovation Center, where he works with AWS customers across different verticals to accelerate their use of generative AI and AWS Cloud services to solve their business challenges.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

Fine-tune Llama 3 for text generation on Amazon SageMaker JumpStart

Fine-tune Llama 3 for text generation on Amazon SageMaker JumpStart

Generative artificial intelligence (AI) models have become increasingly popular and powerful, enabling a wide range of applications such as text generation, summarization, question answering, and code generation. However, despite their impressive capabilities, these models often struggle with domain-specific tasks or use cases due to their general training data. To address this challenge, fine-tuning these models on specific data is crucial for achieving optimal performance in specialized domains.

In this post, we demonstrate how to fine-tune the recently released Llama 3 models from Meta, specifically the llama-3-8b and llama-3-70b variants, using Amazon SageMaker JumpStart. The fine-tuning process is based on the scripts provided in the llama-recipes repo from Meta, utilizing techniques like PyTorch FSDP, PEFT/LoRA, and Int8 quantization for efficient fine-tuning of these large models on domain-specific datasets.

By fine-tuning the Meta Llama 3 models with SageMaker JumpStart, you can harness their improved reasoning, code generation, and instruction following capabilities tailored to your specific use cases.

Meta Llama 3 overview

Meta Llama 3 comes in two parameter sizes—8B and 70B with 8,000 context length—that can support a broad range of use cases with improvements in reasoning, code generation, and instruction following. Meta Llama 3 uses a decoder-only transformer architecture and new tokenizer that provides improved model performance with 128,000 context size. In addition, Meta improved post-training procedures that substantially reduced false refusal rates, improved alignment, and increased diversity in model responses. You can now derive the combined advantages of Meta Llama 3 performance and MLOps controls with Amazon SageMaker features such as Amazon SageMaker Pipelines and Amazon SageMaker Debugger. In addition, the model will be deployed in an AWS secure environment under your virtual private cloud (VPC) controls, helping provide data security.

SageMaker JumpStart

SageMaker JumpStart is a powerful feature within the SageMaker machine learning (ML) environment that provides ML practitioners a comprehensive hub of publicly available and proprietary foundation models (FMs). With this managed service, ML practitioners get access to a growing list of cutting-edge models from leading model hubs and providers that they can deploy to dedicated SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment.

Prerequisites

To try out this solution using SageMaker JumpStart, you’ll need the following prerequisites:

Fine-tune Meta Llama 3 models

In this section, we discuss the steps to fine-tune Meta Llama 3 models. We’ll cover two approaches: using the SageMaker Studio UI for a no-code solution, and utilizing the SageMaker Python SDK.

No-code fine-tuning through the SageMaker Studio UI

SageMaker JumpStart provides access to publicly available and proprietary foundation models from third-party and proprietary providers. Data scientists and developers can quickly prototype and experiment with various ML use cases, accelerating the development and deployment of ML applications. It helps reduce the time and effort required to build ML models from scratch, allowing teams to focus on fine-tuning and customizing the models for their specific use cases. These models are released under different licenses designated by their respective sources. It’s essential to review and adhere to the applicable license terms before downloading or using these models to make sure they’re suitable for your intended use case.

You can access the Meta Llama 3 FMs through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we cover how to discover these models in SageMaker Studio.

SageMaker Studio is an IDE that offers a web-based visual interface for performing the ML development steps, from data preparation to model building, training, and deployment. For instructions on getting started and setting up SageMaker Studio, refer to Amazon SageMaker Studio.

When you’re in SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

SageMaker Studio Home

In the JumpStart view, you’re presented with the list of public models offered by SageMaker. You can explore other models from other providers in this view. To start using the Meta Llama 3 models, under Providers, choose Meta.

Public Models

You’re presented with a list of the models available. Choose the Meta-Llama-3-8B-Instruct model.

Meta Model Provider

Here you can view the model details, as well as train, deploy, optimize, and evaluate the model. For this demonstration, we choose Train.

Meta Llama 3 8B Instruct Details

On this page, you can point to the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning. In addition, you can configure deployment configuration, hyperparameters, and security settings for fine-tuning. Choose Submit to start the training job on a SageMaker ML instance.

Fine-tune model

Deploy the model

After the model is fine-tuned, you can deploy it using the model page on SageMaker JumpStart. The option to deploy the fine-tuned model will appear when fine-tuning is finished, as shown in the following screenshot.

Finetuning Finished Screen

You can also deploy the model from this view. You can configure endpoint settings such as the instance type, number of instances, and endpoint name. You will need to accept the End User License Agreement (EULA) before you can deploy the model.

Deploy Model Screen

Fine-tune using the SageMaker Python SDK

You can also fine-tune Meta Llama 3 models using the SageMaker Python SDK. A sample notebook with the full instructions can be found on GitHub. The following code example demonstrates how to fine-tune the Meta Llama 3 8B model:

import os
import boto3
from sagemaker.session import Session
from sagemaker.jumpstart.estimator import JumpStartEstimator

# To fine-tune the Llama 3 70B model available on JumpStart, please change model_id to `meta-textgeneration-llama-3-70b`.
model_id = "meta-textgeneration-llama-3-8b"
accept_eula = "true"
estimator = JumpStartEstimator(
    model_id=model_id, environment={"accept_eula": accept_eula}
)

# By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use instruction_tuned="True"
estimator.set_hyperparameters(instruction_tuned="True", epoch="5")
estimator.fit({"training": train_data_location})

The code sets up a SageMaker JumpStart estimator for fine-tuning the Meta Llama 3 large language model (LLM) on a custom training dataset. It configures the estimator with the desired model ID, accepts the EULA, enables instruction tuning by setting instruction_tuned="True", sets the number of training epochs, and initiates the fine-tuning process.

When the fine-tuning job is complete, you can deploy the fine-tuned model directly from the estimator, as shown in the following code. As part of the deploy settings, you can define the instance type you want to deploy the model on. For the full list of deployment parameters, refer to the deploy parameters in the SageMaker SDK documentation.

# for Llama 3 70B models, you can deploy to ml.g5.12xlarge instance type or it will default to ml.p4d.24xlarge
finetuned_predictor = estimator.deploy(instance_type='ml.g5.12xlarge')

After the endpoint is up and running, you can perform an inference request against it using the predictor object as follows:

prompt = "Your prompt goes here"
payload = {
        "inputs": prompt,
        "parameters": {"max_new_tokens": 256},
    }
response = finetuned_predictor.predict(payload)
response.get('generated_text')

For the full list of predictor parameters, refer to the predictor object in the SageMaker SDK documentation.

Fine-tuning technique

Language models such as Meta Llama are more than 10 GB or even 100 GB in size. Fine-tuning such large models requires instances with significantly higher CUDA memory. Furthermore, training these models can be very slow due to their size. Therefore, for efficient fine-tuning, we use the following optimizations:

  • Low-Rank Adaptation (LoRA) – This is a type of parameter efficient fine-tuning (PEFT) for efficient fine-tuning of large models. In this, we freeze the whole model and only add a small set of adjustable parameters or layers into the model. For instance, instead of training all 8 billion parameters for Llama 3 8B, we can fine-tune less than 1% of the parameters. This helps significantly reduce the memory requirement because we only need to store gradients, optimizer states, and other training-related information for only 1% of the parameters. Furthermore, this helps reduce both training time and cost. For more details on this method, refer to LoRA: Low-Rank Adaptation of Large Language Models.
  • Int8 quantization – Even with optimizations such as LoRA, models like Meta Llama 70B require significant computational resources for training. To reduce the memory footprint during training, we can employ Int8 quantization. Quantization typically reduces the precision of the floating-point data types. Although this decreases the memory required to store model weights, it can potentially degrade the performance due to loss of information. However, Int8 quantization utilizes only a quarter of the precision compared to full-precision training, but it doesn’t incur significant degradation in performance. Instead of simply dropping bits, Int8 quantization rounds the data from one type to another, preserving the essential information while optimizing memory usage. To learn about Int8 quantization, refer to int8(): 8-bit Matrix Multiplication for Transformers at Scale.
  • Fully Sharded Data Parallel (FSDP) – This is a type of data parallel training algorithm that shards the model’s parameters across data parallel workers and can optionally offload part of the training computation to the CPUs. Although the parameters are sharded across different GPUs, computation of each microbatch is local to the GPU worker. It shards parameters more uniformly and achieves optimized performance through communication and computation overlapping during training.

The following table compares different methods with the two Meta Llama 3 models.

Default Instance Type Supported Instance Types with Default configuration Default Setting LORA + FSDP LORA + No FSDP Int8 Quantization + LORA + No FSDP
Llama 3 8B ml.g5.12xlarge ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge LORA + FSDP Yes Yes Yes
Llama 3 70B ml.g5.48xlarge ml.g5.48xlarge INT8 + LORA + NO FSDP No No Yes

Fine-tuning of Meta Llama models is based on scripts provided by the GitHub repo.

Training dataset format

SageMaker JumpStart currently support datasets in both domain adaptation format and instruction tuning format. In this section, we specify an example dataset in both formats. For more details, refer to the Dataset formatting section in the appendix.

Domain adaptation format

The Meta Llama 3 text generation model can be fine-tuned on domain-specific datasets, enabling it to generate relevant text and tackle various natural language processing (NLP) tasks within a particular domain using few-shot prompting. This fine-tuning process involves providing the model with a dataset specific to the target domain. The dataset can be in various formats, such as CSV, JSON, or TXT files. For example, if you want to fine-tune the model for the domain of financial reports and filings, you could provide it with a text file containing SEC filings from a company like Amazon. The following is an excerpt from such a filing:

This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions.

Instruction tuning format

In instruction fine-tuning, the model is fine-tuned for a set of NLP tasks described using instructions. This helps improve the model’s performance for unseen tasks with zero-shot prompts. In instruction tuning dataset format, you specify the template.json file describing the input and the output formats and the train.jsonl file with the training data item in each line.

The template.json file always has the following JSON format:

{
  "prompt": "<<Prompt goes here along with question or context or instruction>>",
  "completion": "<<completion goes here depending on the activity, for ex: answer for Q&A or summary for Summarization task>>"
}

For instance, the following table shows the template.json and train.jsonl files for the Dolly and Dialogsum datasets.

Dataset Use Case template.json train.jsonl
Dolly Question Answering {
“prompt”: “Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:n{instruction}nn### Input:n{context}nn”,
“completion”: ” {response}”
}
{
“instruction”: “Who painted the Two Monkeys”,
“context”: “Two Monkeys or Two Chained Monkeys is a 1562 painting by Dutch and Flemish Renaissance artist Pieter Bruegel the Elder. The work is now in the Gemäldegalerie (Painting Gallery) of the Berlin State Museums.”,
“response”: “The two Monkeys or Two Chained Monkeys is a 1562 painting by Dutch and Flemish Renaissance artist Pieter Bruegel the Elder. The work is now in the Gemaeldegalerie (Painting Gallery) of the Berlin State Museums.”
}
Dialogsum Text Summarization {
“prompt”: “Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n{dialogue}nn”,
“completion”: ” {summary}”
}
{
“dialogue”: “#Person1#: Where do these flower vases come from? n#Person2#: They are made a town nearby. The flower vases are made of porcelain and covered with tiny bamboo sticks. n#Person1#: Are they breakable? n#Person2#: No. They are not only ornmamental, but also useful. n#Person1#: No wonder it’s so expensive. “,
“summary”: “#Person2# explains the flower vases’ materials and advantages and #Person1# understands why they’re expensive.”
}

Supported hyperparameters for training

The fine-tuning process for Meta Llama 3 models allows you to customize various hyperparameters, each of which can influence factors such as memory consumption, training speed, and the performance of the fine-tuned model. At the time of writing this post, the following are the default hyperparameter values. For the most up-to-date information, refer to the SageMaker Studio console, because these values may be subject to change.

  • epoch – The number of passes that the fine-tuning algorithm takes through the training dataset. Must be an integer greater than 1. Default is 5.
  • learning_rate – The rate at which the model weights are updated after working through each batch of training examples. Must be a positive float greater than 0. Default is 0.0001.
  • lora_r – Lora R dimension. Must be a positive integer. Default is 8.
  • lora_alpha – Lora Alpha. Must be a positive integer. Default is 32.
  • target_modules – Target modules for LoRA fine-tuning. You can specify a subset of [‘q_proj’,’v_proj’,’k_proj’,’o_proj’,’gate_proj’,’up_proj’,’down_proj’] modules as a string separated by a comma without any spaces. Default is q_proj,v_proj.
  • lora_dropout – Lora Dropout. Must be a positive float between 0 and 1. Default is 0.05.
  • instruction_tuned – Whether to instruction-train the model or not. At most one of instruction_tuned and chat_dataset can be True. Must be True or False. Default is False.
  • chat_dataset – If True, dataset is assumed to be in chat format. At most one of instruction_tuned and chat_dataset can be True. Default is False.
  • add_input_output_demarcation_key – For an instruction tuned dataset, if this is True, a demarcation key ("### Response:n") is added between the prompt and completion before training. Default is True.
  • per_device_train_batch_size – The batch size per GPU core/CPU for training. Default is 1.
  • per_device_eval_batch_size – The batch size per GPU core/CPU for evaluation. Default is 1.
  • max_train_samples – For debugging purposes or quicker training, truncate the number of training examples to this value. Value -1 means using all of the training samples. Must be a positive integer or -1. Default is -1.
  • max_val_samples – For debugging purposes or quicker training, truncate the number of validation examples to this value. Value -1 means using all of the validation samples. Must be a positive integer or -1. Default is -1.
  • seed – Random seed that will be set at the beginning of training. Default is 10.
  • max_input_length – Maximum total input sequence length after tokenization. Sequences longer than this will be truncated. If -1, max_input_length is set to the minimum of 1024 and the maximum model length defined by the tokenizer. If set to a positive value, max_input_length is set to the minimum of the provided value and the model_max_length defined by the tokenizer. Must be a positive integer or -1. Default is -1.
  • validation_split_ratio – If validation channel is None, ratio of train-validation split from the train data must be between 0–1. Default is 0.2.
  • train_data_split_seed – If validation data is not present, this fixes the random splitting of the input training data to training and validation data used by the algorithm. Must be an integer. Default is 0.
  • preprocessing_num_workers – The number of processes to use for preprocessing. If None, the main process is used for preprocessing. Default is None.
  • int8_quantization – If True, the model is loaded with 8-bit precision for training. Default for 8B is False. Default for 70B is True.
  • enable_fsdp – If True, training uses FSDP. Default for 8B is True. Default for 70B is False.

Instance types and compatible hyperparameters

The memory requirement during fine-tuning may vary based on several factors:

  • Model type – The 8B model has the smallest GPU memory requirement and the 70B model has a largest memory requirement
  • Max input length – A higher value of input length leads to processing more tokens at a time and as such requires more CUDA memory
  • Batch size – A larger batch size requires larger CUDA memory and therefore requires larger instance types
  • Int8 quantization – If using Int8 quantization, the model is loaded into low precision mode and therefore requires less CUDA memory

To help you get started, we provide a set of combinations of different instance types, hyperparameters, and model types that can be successfully fine-tuned. You can select a configuration as per your requirements and availability of instance types. We fine-tune all three models on a variety of settings with three epochs on a subset of the Dolly dataset with summarization examples.

8B model

Instance Type Max Input Length Per Device Batch Size Int8 Quantization Enable FSDP Time Taken (Minutes)
ml.g4dn.12xlarge 1024 2 TRUE FALSE 202
ml.g4dn.12xlarge 2048 2 TRUE FALSE 192
ml.g4dn.12xlarge 1024 2 FALSE TRUE 98
ml.g4dn.12xlarge 1024 4 TRUE FALSE 200
ml.g5.12xlarge 2048 2 TRUE FALSE 73
ml.g5.12xlarge 1024 2 TRUE FALSE 88
ml.g5.12xlarge 2048 2 FALSE TRUE 24
ml.g5.12xlarge 1024 2 FALSE TRUE 35
ml.g5.12xlarge 2048 4 TRUE FALSE 72
ml.g5.12xlarge 1024 4 TRUE FALSE 83
ml.g5.12xlarge 1024 4 FALSE TRUE 25
ml.g5.12xlarge 1024 8 TRUE FALSE 83
ml.g5.24xlarge 2048 2 TRUE FALSE 73
ml.g5.24xlarge 1024 2 TRUE FALSE 86
ml.g5.24xlarge 2048 2 FALSE TRUE 24
ml.g5.24xlarge 1024 2 FALSE TRUE 35
ml.g5.24xlarge 2048 4 TRUE FALSE 72
ml.g5.24xlarge 1024 4 TRUE FALSE 83
ml.g5.24xlarge 1024 4 FALSE TRUE 25
ml.g5.24xlarge 1024 8 TRUE FALSE 82
ml.g5.48xlarge 2048 2 TRUE FALSE 73
ml.g5.48xlarge 1024 2 TRUE FALSE 87
ml.g5.48xlarge 2048 2 FALSE TRUE 27
ml.g5.48xlarge 1024 2 FALSE TRUE 48
ml.g5.48xlarge 2048 4 TRUE FALSE 71
ml.g5.48xlarge 1024 4 TRUE FALSE 82
ml.g5.48xlarge 1024 4 FALSE TRUE 32
ml.g5.48xlarge 1024 8 TRUE FALSE 81
ml.p3dn.24xlarge 2048 2 TRUE FALSE 104
ml.p3dn.24xlarge 1024 2 TRUE FALSE 114

70B model

Instance Type Max Input Length Per Device Batch Size Int8 Quantization Enable FSDP Time Taken (Minutes)
ml.g5.48xlarge 1024 1 TRUE FALSE 461
ml.g5.48xlarge 2048 1 TRUE FALSE 418
ml.g5.48xlarge 1024 2 TRUE FALSE 423

Recommendations on instance types and hyperparameters

When fine-tuning the model’s accuracy, keep in mind the following:

  • Larger models such as 70B provide better performance than 8B
  • Performance without Int8 quantization is better than performance with Int8 quantization

Note the following training time and CUDA memory requirements:

  • Setting int8_quantization=True decreases the memory requirement and leads to faster training.
  • Decreasing per_device_train_batch_size and max_input_length reduces the memory requirement and therefore can be run on smaller instances. However, setting very low values may increase the training time.
  • If you’re not using Int8 quantization (int8_quantization=False), use FSDP (enable_fsdp=True) for faster and efficient training.

When choosing the instance type, consider the following:

  • At the time of writing this post, the G5 instances provided the most efficient training among the supported instance types. However, because AWS regularly updates and introduces new instance types, we recommend that you validate the recommended instance type for Meta Llama 3 fine-tuning in the SageMaker documentation or SageMaker console before proceeding.
  • Training time largely depends on the amount of GPUs and the CUDA memory available. Therefore, training on instances with the same number of GPUs (for example, ml.g5.2xlarge and ml.g5.4xlarge) is roughly the same. Therefore, you can use the more cost effective instance for training (ml.g5.2xlarge).

To learn about the cost of training per instance, refer to Amazon EC2 G5 Instances.

If your dataset is in instruction tuning format, where each sample consists of an instruction (input) and the desired model response (completion), and these input+completion sequences are short (for example, 50–100 words), using a high value for max_input_length can lead to poor performance. This is because the model may struggle to focus on the relevant information when dealing with a large number of padding tokens, and it can also lead to inefficient use of computational resources. The default value of -1 corresponds to a max_input_length of 1024 for Llama models. We recommend setting max_input_length to a smaller value (for example, 200–400) when working with datasets containing shorter input+completion sequences to mitigate these issues and potentially improve the model’s performance and efficiency.

Lastly, due to the high demand of the G5 instances, you may experience unavailability of these instances in your AWS Region with the error “CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.” If you experience this error, retry the training job or try a different Region.

Issues when fine-tuning large models

In this section, we discuss two issues when fine-tuning very large models.

Disable output compression

By default, the output of a training job is a trained model that is compressed in a .tar.gz format before it’s uploaded to Amazon S3. However, for large models like the 70B model, this compression step can be time-consuming, taking more than 4 hours. To mitigate this delay, it’s recommended to use the disable_output_compression feature supported by the SageMaker training environment. When disable_output_compression is set to True, the model is uploaded without any compression, which can significantly reduce the time taken for large model artifacts to be uploaded to Amazon S3. The uncompressed model can then be used directly for deployment or further processing. The following code shows how to pass this parameter into the SageMaker JumpStart estimator:

estimator = JumpStartEstimator(
model_id=model_id, environment={"accept_eula": "true"}, disable_output_compression=True
)

SageMaker Studio kernel timeout issue

Due to the size of the Meta Llama 3 70B model, the training job may take several hours to complete. The SageMaker Studio kernel is only used to initiate the training job, and its status doesn’t affect the ongoing training process. After the training job starts, the compute resources allocated for the job will continue running the training process, regardless of whether the SageMaker Studio kernel remains active or times out. If the kernel times out during the lengthy training process, you can still deploy the endpoint after training is complete using the training job name with the following code:

from sagemaker.jumpstart.estimator import JumpStartEstimator
training_job_name = <<<INSERT_TRAINING_JOB_NAME>>>

attached_estimator = JumpStartEstimator.attach(training_job_name, model_id)
attached_estimator.logs()
predictor = attached_estimator.deploy()

To find the training job name, navigate to the SageMaker console and under Training in the navigation pane, choose Training jobs. Identify the training job name and substitute it in the preceding code.

Clean up

To prevent incurring unnecessary charges, it’s recommended to clean up the deployed resources when you’re done using them. You can remove the deployed model with the following code:

predictor.delete_predictor()

Conclusion

In this post, we discussed fine-tuning Meta Llama 3 models using SageMaker JumpStart. We showed that you can use the SageMaker JumpStart console in SageMaker Studio or the SageMaker Python SDK to fine-tune and deploy these models. We also discussed the fine-tuning technique, instance types, and supported hyperparameters. In addition, we outlined recommendations for optimized training based on various tests we carried out.

The results for fine-tuning the three models over two datasets are shown in the appendix at the end of this post. As we can see from these results, fine-tuning improves summarization compared to non-fine-tuned models.

As a next step, you can try fine-tuning these models on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use cases.


About the Authors

Ben FriebeBen Friebe is a Senior Solutions Architect at Amazon Web Services, based in Brisbane, Australia. He likes computers.

Pavan Kumar Rao Navule<Pavan Kumar Rao Navule is a Solutions Architect at Amazon Web Services, where he works with ISVs in India to help them innovate on the AWS platform. He is specialized in architecting AI/ML and generative AI services at AWS. Pavan is a published author for the book “Getting Started with V Programming.” In his free time, Pavan enjoys listening to the great magical voices of Sia and Rihanna.

Khush PatelKhush Patel Khush Patel is a Solutions Architect at Amazon Web Services based out of Houston, Texas. He’s passionate about working with customers to deliver business value using technology. He has a multitude of experience with customers working with Machine Learning and GenerativeAI workloads. In his free time, Khush enjoys watching sports and reading.

Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.


Appendix

This appendix provides additional information about performance benchmarking and dataset formatting.

Performance benchmarking

In this section, we provide results for fine-tuning the two Meta Llama 3 models (8B and 70B) on two different datasets: Dolly and Dialogsum. For the Dolly dataset, our task is to summarize a paragraph of text, whereas for Dialogsum, we are fine-tuning the model to summarize a discussion between two people. In the following tables, we show the input to the model (prompt and instructions), ground truth (summary), response from the pre-trained Meta Llama 3 model, and response from the fine-tuned Meta Llama 3 model for each of the models. We show inference results for five data points. You can notice from the following tables that the summaries improve for both the datasets when we fine-tune the models.

Results for fine-tuning the Meta Llama 3 8B text generation model on the Dolly dataset

Inputs Ground Truth Response from Non-Fine-Tuned Model Response from Fine-Tuned Model
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nCan you generate a 2 lines abstract of this text?nn### Input:nIn physics, gravity (from Latin gravitas ‘weight’) is a fundamental interaction which causes mutual attraction between all things with mass or energy. Gravity is, by far, the weakest of the four fundamental interactions, approximately 1038 times weaker than the strong interaction, 1036 times weaker than the electromagnetic force and 1029 times weaker than the weak interaction. As a result, it has no significant influence at the level of subatomic particles. However, gravity is the most significant interaction between objects at the macroscopic scale, and it determines the motion of planets, stars, galaxies, and even light.nnOn Earth, gravity gives weight to physical objects, and the Moon’s gravity is responsible for sublunar tides in the oceans (the corresponding antipodal tide is caused by the inertia of the Earth and Moon orbiting one another). Gravity also has many important biological functions, helping to guide the growth of plants through the process of gravitropism and influencing the circulation of fluids in multicellular organisms. Investigation into the effects of weightlessness has shown that gravity may play a role in immune system function and cell differentiation within the human body.nnThe gravitational attraction between the original gaseous matter in the universe allowed it to coalesce and form stars which eventually condensed into galaxies, so gravity is responsible for many of the large-scale structures in the universe. Gravity has an infinite range, although its effects become weaker as objects get farther away.nnGravity is most accurately described by the general theory of relativity (proposed by Albert Einstein in 1915), which describes gravity not as a force, but as the curvature of spacetime, caused by the uneven distribution of mass, and causing masses to move along geodesic lines. The most extreme example of this curvature of spacetime is a black hole, from which nothing—not even light—can escape once past the black hole’s event horizon However, for most applications, gravity is well approximated by Newton’s law of universal gravitation, which describes gravity as a force causing any two bodies to be attracted toward each other, with magnitude proportional to the product of their masses and inversely proportional to the square of the distance between them.nnnn### Response:n In general physics theory (Newton), Gravity is a mutual attraction between objects based on their respective masses and the square of the distance which separates them. In the relativity theory (Einstein), it’s a curvature of the spacetime generated by the objects themselves. Gravity is a fundamental interaction which causes mutual attraction between all things with mass or energy. It is the weakest of the four fundamental interactions, but it is the most significant interaction between objects at the macroscopic scale, and it determines the motion of planets, stars, galaxies, and even light. Gravity gives weight to physical objects, and the Moon’s gravity is responsible for sublunar tides in the oceans. It also has many important biological functions, helping to guide the growth of plants through the process Gravity is a fundamental interaction which causes mutual attraction between all things with mass or energy. Gravity is, by far, the weakest of the four fundamental interactions, approximately 1038 times weaker than the strong interaction, 1036 times weaker than the electromagnetic force and 1029 times weaker than the weak interaction. Gravity is most accurately described by the general theory of relativity (proposed by Albert Einstein in 1915), which describes gravity not as a force, but as the curvature of spacetime
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhat is ProSavinnn### Input:nProSavin is an experimental drug believed to be of use in the treatment of Parkinson’s disease. It is administered to the striatum in the brain, inducing production of dopamine.nnIt is manufactured by Oxford BioMedica. Results from a Phase I/II clinical trial were published in the Lancet and showed safety, but little efficacy. ProSavin was superseded by AXO-Lenti-PD (OXB-102), an optimized version of the drug.nnnn### Response:n ProSavin is an experimental drug believed to be of use in the treatment of Parkinson’s disease. It is administered to the striatum in the brain, inducing production of dopamine. ProSavin is an experimental drug believed to be of use in the treatment of Parkinson’s disease. It is administered to the striatum in the brain, inducing production of dopamine.nnIt is manufactured by Oxford BioMedica. Results from a Phase I/II clinical trial were published in the Lancet and showed safety, but little efficacy. ProSavin was superseded by AXO-Lenti-PD (OXB-102), an optimized version of the drug. ProSavin is an experimental drug believed to be of use in the treatment of Parkinson’s disease. It is administered to the striatum in the brain, inducing production of dopamine.nnIt is manufactured by Oxford BioMedica. Results from a Phase I/II clinical trial were published in the Lancet and showed safety, but little efficacy. ProSavin was superseded by AXO-Lenti-PD (OXB-102), an optimized version of the drug.
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWho were the Beatles?nn### Input:nThe Beatles were an English rock band, formed in Liverpool in 1960, that comprised John Lennon, Paul McCartney, George Harrison and Ringo Starr. They are regarded as the most influential band of all time and were integral to the development of 1960s counterculture and popular music’s recognition as an art form. Rooted in skiffle, beat and 1950s rock ‘n’ roll, their sound incorporated elements of classical music and traditional pop in innovative ways; the band also explored music styles ranging from folk and Indian music to psychedelia and hard rock. As pioneers in recording, songwriting and artistic presentation, the Beatles revolutionised many aspects of the music industry and were often publicised as leaders of the era’s youth and sociocultural movements.nnnn### Response:n The Beatles were a rock band comprised of 4 musicians. They were pioneers who had a huge influence on the music industry and wider sociocultural movements. The Beatles were an English rock band, formed in Liverpool in 1960, that comprised John Lennon, Paul McCartney, George Harrison and Ringo Starr. They are regarded as the most influential band of all time and were integral to the development of 1960s counterculture and popular music’s recognition as an art form. Rooted in skiffle, beat and 1950s rock ‘n’ roll, their sound incorporated elements of classical music and traditional pop in innovative ways; the The Beatles were an English rock band, formed in Liverpool in 1960, that comprised John Lennon, Paul McCartney, George Harrison and Ringo Starr. They are regarded as the most influential band of all time and were integral to the development of 1960s counterculture and popular music’s recognition as an art form. Rooted in skiffle, beat and 1950s rock ‘n’ roll, their sound incorporated elements of classical music and traditional pop in innovative ways; the
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nSummarize what Alpha Phi is.nn### Input:nAlpha Phi International Women’s Fraternity (ΑΦ, also known as APhi) is an international sorority with 172 active chapters and over 250,000 initiated members.nnFounded at Syracuse University in Syracuse, New York on September 18, 1872, it is the fourth Greek-letter organization founded for women, and the first women’s fraternity founded in the northeast.nnAlpha Phi is a member of the National Panhellenic Conference, the governing council of 26 women’s fraternities. Alpha Phi’s international headquarters are located in Evanston, Illinois.nnnn### Response:n Alpha Phi is one of the first women’s fraternities founded in the Northeast. They have over 172 active chapters at College campuses and over 250,000 initiated members. Alpha Phi was founded at Syracuse University in Syracuse, New York on September 18, 1872. Alpha Phi is an international sorority with 172 active chapters and over 250,000 initiated members. Founded at Syracuse University in Syracuse, New York on September 18, 1872, it is the fourth Greek-letter organization founded for women, and the first women’s fraternity founded in the northeast. Alpha Phi is a member of the National Panhellenic Conference, the governing council of 26 women’s fraternities. Alpha Phi’s international headquarters are located in Evanston, Illinois Alpha Phi is an international sorority with 172 active chapters and over 250,000 initiated members. It was founded at Syracuse University in Syracuse, New York on September 18, 1872, and is the fourth Greek-letter organization founded for women, and the first women’s fraternity founded in the northeast. Alpha Phi is a member of the National Panhellenic Conference, the governing council of 26 women’s fraternities. Alpha Phi’s international headquarters are located in Evanston
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nplease give me a short description about SVB collapsenn### Input:nOn Friday, March 10, 2023, Silicon Valley Bank (SVB) failed after a bank run, marking the second-largest bank failure in United States history and the largest since the 2007–2008 financial crisis. It was one of three March 2023 United States bank failures.nnnn### Response:n It’s due to risk mismanagement of SVB Bank, where the bank bet on long dated treasury bond. When the bank runs happen, SVB don’t have a choice but to sell those long dated bond at a steep discount and incur huge lossses. Silicon Valley Bank (SVB) was a bank that failed in 2023. It was one of the largest banks in the United States, and its failure had a significant impact on the financial markets. The bank was founded in 1983 and was headquartered in Santa Clara, California. It was known for its focus on the technology industry and its support of startups and venture capital firms. However, in 2023, the bank faced a number of challenges, including a decline in the value Silicon Valley Bank (SVB) failed after a bank run, marking the second-largest bank failure in United States history and the largest since the 2007–2008 financial crisis. It was one of three March 2023 United States bank failures.

Results for fine-tuning the Meta Llama 3 8B text generation model on the Dialogsum dataset

Inputs Ground Truth Response from Non-Fine-Tuned Model Response from Fine-Tuned Model
Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Good moring, madam. Can I help you? n#Person2#: Good moring. I am Sunfang. Yesterday Mr. Smith told me to come for a job interview this morning. Is he available now? n#Person1#: Yes, he is waiting for you in Room 123. n#Person2#: Thank you very much. May I come in? n#Person3#: Come in, please. n#Person2#: Good moring, Mr. Smith. I am Sunfang coming for an interview. n#Person3#: Nice to see you. Take a seat, please. n#Person2#: Thank you. n#Person3#: We’ve been looking over your application. I see you’ve got several years experience in secretarial work. What kind of work did you do in your previous job? n#Person2#: I mainly did typing, filing, answering telephone calls and such routine work in the beginning. I started the job as a junior secretary, but within the year I was promoted to senior secretary of Southeast Company. I drafted business correspondence for the general manager. n#Person3#: Then what’s the reason that promoted you to leave Southeast Company? n#Person2#: Last month the company moved to Guangzhou for further development. I was reluctant to relocate because I love Nanjing so much. n#Person3#: Tell me what you know about our company. n#Person2#: Your company is the leading dealer in the import and export of cosmetics. Also, you have a reputation for being a good company to work for and you are in last year’s top 500 Employers list. n#Person3#: Tell me three positive qualities your coworkers would use to describe you and your work style. n#Person2#: First they’d say I am very punctual. They’d also say I am a cheerful person to work with I don’t bring my problems to work. I try to have a positive attitude about work. And thirdly, they’d say I am flexible I accept change easily. n#Person3#: Tell me something about your education. n#Person2#: I graduated from Yangzhou Junior College. Now I am attending evening classes in Nanjing Radio & TV University three times a week. My major is International Trade. I’ve already passed over half of the compulsory courses and five optional courses, and expect to receive a B. A. degree next year. n#Person3#: I see you are an ambitious young lady. I wish you can make greater progress in the future. nnnn### Response:n #Person1# leads Sunfang to Mr. Smith for an interview. Sunfang tells Mr. Smith she was promoted fast in her previous job. She wants to work for Mr. Smith because her old company moved but she didn’t want to relocate, and because She thinks Mr. Smith’s company plays a leading role. Sunfang thinks her colleagues might describe her as being punctual, positive and flexible. Mr. Smith is impressed. #Person1#: Good moring, madam. Can I help you? n#Person2#: Good moring. I am Sunfang. Yesterday Mr. Smith told me to come for a job interview this morning. Is he available now? n#Person1#: Yes, he is waiting for you in Room 123. n#Person2#: Thank you very much. May I come in? n#Person3#: Come in, please. n#Person2#: Good moring, Mr. Smith. I am Sunfang coming for an interview. n#Person3#: Nice to see you. Take a seat, please. n#Person2#: Thank you. n#Person3#: We’ve been looking over your application. I see you’ve got several years experience in secretarial work. What kind of work did you do in your previous job? n#Person2#: I mainly did typing, filing, answering telephone calls and such routine work in the beginning. I started the job as a junior secretary, but within the year I was promoted to senior secretary of Southeast Company. I drafted business correspondence for the general manager. n#Person3#: Then what’s the reason that promoted you to leave Southeast Company? n#Person2#: Last month the company moved to Guangzhou for further development. I was reluctant to relocate because I love Nanjing so much. n#Person3#: Tell me what you know about our company. n#Person2#: Your company Sunfang comes for a job interview with Mr. Smith. She tells Mr. Smith about her previous job and her work experience. She also tells Mr. Smith about her company and her education.
Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Well, Jenny, the school year is almost over. We just have two more weeks before exams. What do you plan to do this summer?n#Person2#: I’m going to teach English to some immigrants in the university’s community service program.n#Person1#: That sounds interesting. Don’t you need to speak a foreign language for a job like that?n#Person2#: No, you don’t. You just have to present the language simply and give the students a chance to practice speaking.n#Person1#: Come to think of it, that’s the way I was taught to speak Chinese. But speaking didn’t help me learn to read and write Chinese.n#Person2#: My students won’t want to read and write English, at lease not now. They are more interested in speaking.n#Person1#: You sound very knowledgeable about all this. How do you know so much?n#Person2#: I took a Teaching English as a Second Language course last year when you were in China. I’ve also talked with the experienced teachers quite a lot. I think I would like to be an ESL teacher when I graduate.nnnn### Response:n Jenny plans to teach English to some immigrants in the university’s community service program this summer and she tells #Person1# that students are more interested in speaking. #Person1#: Well, Jenny, the school year is almost over. We just have two more weeks before exams. What do you plan to do this summer?n#Person2#: I’m going to teach English to some immigrants in the university’s community service program.n#Person1#: That sounds interesting. Don’t you need to speak a foreign language for a job like that?n#Person2#: No, you don’t. You just have to present the language simply and give the students a chance to practice speaking.n#Person1#: Come to think of it, that’s the way I was taught to speak Chinese. But speaking didn’t help me learn to read and write Chinese.n#Person2#: My students won’t want to read and write English, at lease not now. They are more interested in speaking.n#Person1#: You sound very knowledgeable about all this. How do you know so much?n#Person2#: I took a Teaching English as a Second Language course last year when you were in China. I’ve also talked with the experienced teachers quite a lot. I think I would like to be an ESL teacher when I graduate.n Jenny tells #Person1# she’s going to teach English to some immigrants in the university’s community service program. Jenny thinks speaking English is the most important part and she’s knowledgeable about it.
Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Can I help you?n#Person2#: Yes, I am looking for a pair of gloves. I think I wear size six.n#Person1#: The gloves are on this counter. Let’s see… How about the blue pair? The color suits you and they are washable, too.n#Person2#: Oh, I’ll try them on. They seem to fit. How much shall I pay you?n#Person1#: They are on sales this week. They have been reduced to twenty-five dollars.n#Person2#: All right. I will take them.n#Person1#: They are very nice.n#Person2#: Thank you.nnnn### Response:n #Person2# buys a pair of blue gloves of size six with #Person1#’s assistance. #Person1#: Can I help you?n#Person2#: Yes, I am looking for a pair of gloves. I think I wear size six.n#Person1#: The gloves are on this counter. Let’s see… How about the blue pair? The color suits you and they are washable, too.n#Person2#: Oh, I’ll try them on. They seem to fit. How much shall I pay you?n#Person1#: They are on sales this week. They have been reduced to twenty-five dollars.n#Person2#: All right. I will take them.n#Person1#: They are very nice.n#Person2#: Thank you.n #Person2# buys a pair of gloves with #Person1#’s help.
Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: How is your business on the Internet, Steven?n#Person2#: Don’t mention it. I lost a bunch of money on it this year. I think it is on the brink of bankruptcy.n#Person1#: I am sorry to hear that. It shows that you should buy companies, not ideas.n#Person2#: I changed my strategy around too. I am investing in good companies now.n#Person1#: Yes, I agree with you. I am going to find a good company and invest in it soon.n#Person2#: It’s a good idea. But please be careful when you invest this time.n#Person1#: Yes, I will. I should learn more from you. After all I am green hand on it.n#Person2#: If you have any question, you can ask me.n#Person1#: OK. My wife sent me a message and asked me to go home just now. I am leaving.n#Person2#: All right. Let’s chat next time. Bye-bye.n#Person1#: Bye.nnnn### Response:n Steven lost much money on Internet business so he’s now investing in good companies. #Person1# wants to invest too but #Person1# is green hand on investment. Steven’s willing to help. #Person1#: How is your business on the Internet, Steven?n#Person2#: Don’t mention it. I lost a bunch of money on it this year. I think it is on the brink of bankruptcy.n#Person1#: I am sorry to hear that. It shows that you should buy companies, not ideas.n#Person2#: I changed my strategy around too. I am investing in good companies now.n#Person1#: Yes, I agree with you. I am going to find a good company and invest in it soon.n#Person2#: It’s a good idea. But please be careful when you invest this time.n#Person1#: Yes, I will. I should learn more from you. After all I am green hand on it.n#Person2#: If you have any question, you can ask me.n#Person1#: OK. My wife sent me a message and asked me to go home just now. I am leaving.n#Person2#: All right. Let’s chat next time. Bye-bye.n#Person1#: Bye.n Steven lost a bunch of money on the Internet and changed his strategy to invest in good companies. #Person1# agrees and will be careful when investing.
Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Which language do you speak?n#Person2#: Well, I studied French at school, but I’ve forgotten most of it. I learnt some Russian while I was studying there and now I study Spanish in evening classes. Do you speak Japanese at home with your parents.n#Person1#: Yes, I do. I’m learning Chinese now on the internet.n#Person2#: How do you find studying a language on the internet?n#Person1#: It’s not easy, particularly with listening and speaking, because sometimes the connection isn’t too good.n#Person2#: I bet you find the character difficult to write!n#Person1#: Actually, many of the characters are the same or very similar in both Chinese and Japanese, so they’re easy for me.n#Person2#: So, you can read and write much better than you speak and understand other speakers, right?n#Person1#: Yes. How are the Spanish classes going?n#Person2#: In a way, it’s similar to your experience of learning Chinese. Some Spanish words are very similar to English ones. But I need to practise listening to native speakers and saying the words myself.nnnn### Response:n #Person1# is learning Chinese on the internet and finds listening and speaking difficult, but thinks writing is easy. #Person2# thinks #Person2#’s experience of learning Spanish is similar to #Person1#’s. #Person1#: Which language do you speak?n#Person2#: Well, I studied French at school, but I’ve forgotten most of it. I learnt some Russian while I was studying there and now I study Spanish in evening classes. Do you speak Japanese at home with your parents.n#Person1#: Yes, I do. I’m learning Chinese now on the internet.n#Person2#: How do you find studying a language on the internet?n#Person1#: It’s not easy, particularly with listening and speaking, because sometimes the connection isn’t too good.n#Person2#: I bet you find the character difficult to write!n#Person1#: Actually, many of the characters are the same or very similar in both Chinese and Japanese, so they’re easy for me.n#Person2#: So, you can read and write much better than you speak and understand other speakers, right?n#Person1#: Yes. How are the Spanish classes going?n#Person2#: In a way, it’s similar to your experience of learning Chinese. Some Spanish words are very similar to English ones. But I need to practise listening to native speakers and saying the words myself.n #Person1# and #Person2# are talking about the languages they speak. #Person1# is learning Chinese on the internet and #Person2# is learning Spanish. They both find it difficult to learn the language.

Results for fine-tuning the Meta Llama 3 70B text generation model on the Dolly dataset

Inputs Ground Truth Response from Non-Fine-Tuned Model Response from Fine-Tuned Model
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nCan you give me a short summary of the follow paragraph about the history of electronic music?nn### Input:nDuring the 1960s, digital computer music was pioneered, innovation in live electronics took place, and Japanese electronic musical instruments began to influence the music industry. In the early 1970s, Moog synthesizers and drum machines helped popularize synthesized electronic music. The 1970s also saw electronic music begin to have a significant influence on popular music, with the adoption of polyphonic synthesizers, electronic drums, drum machines, and turntables, through the emergence of genres such as disco, krautrock, new wave, synth-pop, hip hop, and EDM. In the early 1980s mass-produced digital synthesizers, such as the Yamaha DX7, became popular, and MIDI (Musical Instrument Digital Interface) was developed. In the same decade, with a greater reliance on synthesizers and the adoption of programmable drum machines, electronic popular music came to the fore. During the 1990s, with the proliferation of increasingly affordable music technology, electronic music production became an established part of popular culture. In Berlin starting in 1989, the Love Parade became the largest street party with over 1 million visitors, inspiring other such popular celebrations of electronic music.nnnn### Response:n With the advent of new electronic instruments in the 60s and 70s, namely synthesizers and drum machines, electronic music quickly gained mainstream popularity through genres like disco and hip hop. As the technology progressed through the 80s and 90s, more affordable instruments led electronic music to be even more widely embraced, cementing its status in pop culture and leading to the rise of large electronic music events around the globe. The history of electronic music can be traced back to the 1960s, when digital computer music was pioneered, innovation in live electronics took place, and Japanese electronic musical instruments began to influence the music industry. In the early 1970s, Moog synthesizers and drum machines helped popularize synthesized electronic music. The 1970s also saw electronic music begin to have a significant influence on popular music, with the adoption of polyphonic synthesizers, electronic drums, drum machines, and Electronic music has been around since the 1960s. In the 1970s, Moog synthesizers and drum machines helped popularize synthesized electronic music. In the 1980s, digital synthesizers became popular and MIDI was developed. In the 1990s, electronic music production became an established part of popular culture.
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWho was Nike in Greek Mythology?nn### Input:nIn Greek mythology, Nike (/ˈnaɪki/ (listen); Ancient Greek: Νίκη, lit. ‘victory’, ancient: [nǐː.kɛː], modern: [ˈni.ci]) was a goddess who personified victory in any field including art, music, war, and athletics. She is often portrayed in Greek art as Winged Victory in the motion of flight; however, she can also appear without wings as “Wingless Victory” when she is being portrayed as an attribute of another deity such as Athena. In Greek literature Nike is described as both an attribute and attendant to the gods Zeus and Athena. Nike gained this honored role beside Zeus during the Titanomachy where she was one of the first gods to offer her allegiance to Zeus. At Athens, Nike became a servant to Athena as well as an attribute of her due to the prominent status Athena held in her patron city. The fusion of the two goddesses at Athens has contributed to the ambiguity surrounding Nike’s origins. It is unclear whether she originated from a character trait of the Greek goddess Athena or has always existed as an independent deity. Her origin story in Greek mythology is also slightly ambiguous, with the Theogony claiming Nike to be the daughter of Styx and Pallas while the Homeric Hymns describe Ares as being Nike’s father. Her Roman equivalent was Victoria.nnnn### Response:n Nike is the god of ‘victory’ in Greek mythology, and is often portrayed in art Winged Victory in the motion of flight. Nike was the Greek goddess of victory. She was the daughter of the Titan Pallas and the goddess Styx, and the sister of Kratos (Strength), Bia (Force), and Zelus (Rivalry). She was depicted as a winged goddess, and was often shown holding a wreath of victory or a palm branch. She was also associated with the goddess Athena, and was often depicted as her companion. Nike was worshipped throughout Greece, and her cult was particularly strong Nike was the Greek goddess of victory. She was often portrayed in Greek art as Winged Victory in the motion of flight. Nike gained this honored role beside Zeus during the Titanomachy where she was one of the first gods to offer her allegiance to Zeus. At Athens, Nike became a servant to Athena as well as an attribute of her due to the prominent status Athena held in her patron city.
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nGive me some details about the MV Cape Race pleasenn### Input:nThe MV Cape Race (T-AKR-9960) was originally built as the MV Seaspeed America for the Transocean RO-RO Corp. She was sold to various other corporations until she was purchased by the United States Navy in April 1993. She underwent conversion to fit a more active role in and serve a military need better and was placed back in service April 1993. She has served on multiple occasions but is currently in the US Ready Reserve Fleet.nnnn### Response:n The MV Cape Race (T-AKR-9960) was originally built as the MV Seaspeed America for the Transocean RO-RO Corp. She was sold to various other corporations until she was purchased by the United States Navy in April 1993. nnShe underwent conversion to fit a more active role in and serve a military need better and was placed back in service April 1993. She has served on multiple occasions but is currently in the US Ready Reserve Fleet. The MV Cape Race (T-AKR-9960) was originally built as the MV Seaspeed America for the Transocean RO-RO Corp. She was sold to various other corporations until she was purchased by the United States Navy in April 1993. She underwent conversion to fit a more active role in and serve a military need better and was placed back in service April 1993. She has served on multiple occasions but is currently in the US Ready Reserve Fleet. The MV Cape Race (T-AKR-9960) was originally built as the MV Seaspeed America for the Transocean RO-RO Corp. She was sold to various other corporations until she was purchased by the United States Navy in April 1993. She underwent conversion to fit a more active role in and serve a military need better and was placed back in service April 1993. She has served on multiple occasions but is currently in the US Ready Reserve Fleet.
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhat is sweep rowing?nn### Input:nRowing, sometimes called crew in the United States, is the sport of racing boats using oars. It differs from paddling sports in that rowing oars are attached to the boat using oarlocks, while paddles are not connected to the boat. Rowing is divided into two disciplines: sculling and sweep rowing. In sculling, each rower holds two oars—one in each hand, while in sweep rowing each rower holds one oar with both hands. There are several boat classes in which athletes may compete, ranging from single sculls, occupied by one person, to shells with eight rowers and a coxswain, called eights. There are a wide variety of course types and formats of racing, but most elite and championship level racing is conducted on calm water courses 2 kilometres (1.2 mi) long with several lanes marked using buoys.nnnn### Response:n Sweep rowing is a rowing discipline in which rowers hold one oar with both hands. Sweep rowing is a type of rowing where each rower holds one oar with both hands. There are several boat classes in which athletes may compete, ranging from single sculls, occupied by one person, to shells with eight rowers and a coxswain, called eights. There are a wide variety of course types and formats of racing, but most elite and championship level racing is conducted on calm water courses 2 kilometres (1.2 mi) long with several lanes Sweep rowing is a type of rowing where each rower holds one oar with both hands.
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhat is the Pandemis corylanann### Input:nPandemis corylana, the chequered fruit-tree tortrix, hazel tortrix moth, filbert tortricid or barred fruit tree moth, is a moth of the family Tortricidae. It is found from northern and central Europe to Siberia, Korea and Japan.nnnn### Response:n The Pandemis corylana, the chequered fruit-tree tortrix, hazel tortrix moth, filbert tortricid or barred fruit tree moth, is a moth of the family Tortricidae. It is found from northern and central Europe to Siberia, Korea and Japan. Pandemis corylana, the chequered fruit-tree tortrix, hazel tortrix moth, filbert tortricid or barred fruit tree moth, is a moth of the family Tortricidae. It is found from northern and central Europe to Siberia, Korea and Japan. The Pandemis corylana, also known as the chequered fruit-tree tortrix, hazel tortrix moth, filbert tortricid or barred fruit tree moth, is a moth of the family Tortricidae. It is found from northern and central Europe to Siberia, Korea and Japan.

Results for fine-tuning the Meta Llama 3 70B text generation model on the Dialogsum dataset

Inputs Ground Truth Response from Non-Fine-Tuned Model Response from Fine-Tuned Model
Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Would you like some more coffee, James?n#Person2#: Oh, yes, please. Another day’s work. I like the job, but I hate traveling around London.n#Person1#: Oh, come on, James. How do you get to work?n#Person2#: I walk to the station and take the underground. And sometimes I take the taxi if I get up late.n#Person1#: How long does the journey take?n#Person2#: It takes about 45 minutes on a good day. It’s just awful.n#Person1#: Well, how much does it cost?n#Person2#: It’s twenty pounds a week. I think that’s quite a lot.n#Person1#: Oh, James. Stop complaining. And hurry up. You’ll be late.nnnn### Response:n James likes the job but hates the commute because it takes a long time and costs a lot. #Person1#: Would you like some more coffee, James?n#Person2#: Oh, yes, please. Another day’s work. I like the job, but I hate traveling around London.n#Person1#: Oh, come on, James. How do you get to work?n#Person2#: I walk to the station and take the underground. And sometimes I take the taxi if I get up late.n#Person1#: How long does the journey take?n#Person2#: It takes about 45 minutes on a good day. It’s just awful.n#Person1#: Well, how much does it cost?n#Person2#: It’s twenty pounds a week. I think that’s quite a lot.n#Person1#: Oh, James. Stop complaining. And hurry up. You’ll be late.nn#Person1# asked #Person2# if he would like some more coffee. #Person2# said yes and that he liked his job but hated traveling around London. #Person1# asked how he got to work. #Person2# said he walked to the station and took the underground, and sometimes took a taxi if he got up late. #Person1# asked how long the journey took. #Person2# said it took about 45 minutes on a good day and that it was awful. #Person1# asked how much it cost. #Person2# said it was twenty pounds a week and that he thought that was quite a lot James complains to #Person1# about the traveling around London. #Person1# asks him to hurry up.
Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: What is so special about tomorrow?n#Person2#: Tomorrow is October 24th, United Nations Day.n#Person1#: Why is October 24th United Nations Day?n#Person2#: That’s the date in 1945 when the United Nations began.n#Person1#: How does your school celebrate United Nations Day?n#Person2#: The UN Day celebration includes the songs and dances of different countries in the United Nations and a traditional dress parade. Our school library displays students’ works of art from different countries and cultures. We also hold a ‘Tastes from Around the World’ food festival.n#Person1#: What’s the purpose of the celebration?n#Person2#: To help us students learn about other cultures and understand and respect other cultures.n#Person1#: It is really worthwhile.nnnn### Response:n #Person1# and #Person2# talk about the United Nations Day. #Person2# tells #Person1# how the school will celebrate the day and the purpose of it. #Person1#: What is so special about tomorrow?n#Person2#: Tomorrow is October 24th, United Nations Day.n#Person1#: Why is October 24th United Nations Day?n#Person2#: That’s the date in 1945 when the United Nations began.n#Person1#: How does your school celebrate United Nations Day?n#Person2#: The UN Day celebration includes the songs and dances of different countries in the United Nations and a traditional dress parade. Our school library displays students’ works of art from different countries and cultures. We also hold a ‘Tastes from Around the World’ food festival.n#Person1#: What’s the purpose of the celebration?n#Person2#: To help us students learn about other cultures and understand and respect other cultures.n#Person1#: It is really worthwhile.n #Person2# tells #Person1# United Nations Day is October 24th. #Person2#’s school celebrates it to help students understand and respect other cultures.
Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Have you heard about Alice and David?n#Person2#: No, have they had another fight?n#Person1#: No, they’ve got engaged.n#Person2#: You must be joking. Those two?n#Person1#: Well, my dear, I didn’t believe it either. But I got it straight from the horse’s mouth. David called me this morning.n#Person2#: So when did this all happen?n#Person1#: Last weekend, while they were on their ski trip.n#Person2#: Well, I believe it now, and when are they are getting married?n#Person1#: Next June.n#Person2#: I can hardly believe it.nnnn### Response:n #Person1# and #Person2# can’t believe that Alice and David, who fought with each other, got engaged. #Person1# told #Person2# that Alice and David had got engaged. #Person2# was surprised and asked when it had happened. #Person1# said that it had happened last weekend, while they were on their ski trip. #Person2# asked when they were getting married. #Person1# said that they were getting married next June. #Person1# tells #Person2# that Alice and David have got engaged. #Person2# is surprised.
Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: We did a survey on your job performance.n#Person2#: How did I do?n#Person1#: Not so good. Lots of complains.n#Person2#: Really? I thought I was doing well.n#Person1#: Your performance so far is not quite in line with what we expect for our executives. I hope I’Ve made my point.nnnn### Response:n #Person1# tells #Person2# #Person2#’s job performance isn’t satisfying, despite #Person2#’s opposite opinion. #Person1# told #Person2# that they did a survey on their job performance. #Person2# asked how they did, and #Person1# said that they did not do so good and that there were lots of complains. #Person2# was surprised and said that they thought they were doing well. #Person1# said that their performance so far was not quite in line with what they expect for their executives, and they hoped that they had made their point. #Person1# tells #Person2# that #Person2#’s job performance is not good.
Below is a Instruction that holds conversation which describes discussion between two people.Write a response that appropriately summarizes the conversation.nn### Instruction:n#Person1#: Hi, Francis, morning! How are you doing today?n#Person2#: I had a bad start of the day. I was stuck in a bad traffic jam on my way and I missed my morning meeting.n#Person1#: Indeed, I agree that is not a good start. How about the meeting? You missed anything important?n#Person2#: It is ok. My boss briefs me later on.n#Person1#: Was he unhappy with you not attending the meeting?n#Person2#: Err, kind of. You know this traffic jam is really bad with so many new cars hitting the streets these days.nnnn### Response:n Francis was stuck in a traffic jam and missed the morning meeting. His boss briefs him. #Person1#: Hi, Francis, morning! How are you doing today?n#Person2#: I had a bad start of the day. I was stuck in a bad traffic jam on my way and I missed my morning meeting.n#Person1#: Indeed, I agree that is not a good start. How about the meeting? You missed anything important?n#Person2#: It is ok. My boss briefs me later on.n#Person1#: Was he unhappy with you not attending the meeting?n#Person2#: Err, kind of. You know this traffic jam is really bad with so many new cars hitting the streets these days.nn#Person1# asked #Person2# how he was doing today. #Person2# said he had a bad start of the day because he was stuck in a bad traffic jam on his way and he missed his morning meeting. #Person1# agreed that it was not a good start and asked if #Person2# missed anything important in the meeting. #Person2# said it was ok because his boss would brief him later on. #Person1# asked if his boss was unhappy with him not attending the meeting. #Person2# said he was kind of unhappy because the traffic jam was really bad with so many new cars hitting the streets these days. Francis tells #Person1# he missed the meeting because of the traffic jam.

Dataset formatting

We currently offer two types of fine-tuning: instruction fine-tuning and domain adaption fine-tuning. You can switch to one of the training methods by specifying the parameter instruction_tuned as True or False.

Domain adaption format

The text generation model can be fine-tuned on any domain-specific dataset to incorporate domain-specific knowledge and language patterns. After fine-tuning on the domain-specific dataset, the model is expected to generate more relevant and accurate text within that domain. Although few-shot prompting can also guide the model towards domain-specific generation, the fine-tuning process plays a crucial role in adapting the model’s understanding and generation capabilities to the target domain. The combination of fine-tuning on domain data and effective prompting techniques can enable the model to perform various NLP tasks within that specific domain more effectively.

For input to the model, use a training and optional validation directory. Each directory contains a CSV, JSON, or TXT file. For CSV and JSON files, the train or validation data is used from the column called text or the first column if no column called text is found. The number of files under train and validation (if provided) should equal to 1, respectively.

The output is a trained model that can be deployed for inference.

The following is an example of a TXT file for fine-tuning the text generation model. The TXT file is SEC filings of Amazon from 2021–2022:

This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions. Forward-looking
statements are based on current expectations and assumptions that are subject
to risks and uncertainties that may cause actual results to differ materially.
We describe risks and uncertainties that could cause actual results and events
to differ materially in “Risk Factors,” “Management’s Discussion and Analysis
of Financial Condition and Results of Operations,” and “Quantitative and
Qualitative Disclosures about Market Risk” (Part II, Item 7A of this Form
10-K). Readers are cautioned not to place undue reliance on forward-looking
statements, which speak only as of the date they are made. We undertake no
obligation to update or revise publicly any forward-looking statements,
whether because of new information, future events, or otherwise.

GENERAL

Embracing Our Future ...

Instruction fine-tuning

The text generation model can be instruction-tuned on any text data provided that the data is in the expected format. The instruction-tuned model can be further deployed for inference.

For input, use a training and optional validation directory. The train and validation directories should contain one or multiple JSON lines (.jsonl) formatted files. In particular, the train directory can also contain an optional *.json file describing the input and output formats.

The best model is selected according to the validation loss, calculated at the end of each epoch. If a validation set is not given, an (adjustable) percentage of the training data is automatically split and used for validation.

The training data must be formatted in a JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder; however, it can be saved in multiple .jsonl files. The .jsonl file extension is mandatory. The training folder can also contain a template.json file describing the input and output formats. If no template file is given, the following template will be used:

{
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:n{instruction}nn### Input:n{context}nn",
    "completion": "{response}"
}

In this case, the data in the JSON lines entries must include prompt and completion fields. If a custom template is provided, it must also use prompt and completion keys to define the input and output templates. The following is a sample custom template:

{
    "prompt": "question: {question} context: {context}",
    "completion": "{answer}"
}

Here, the data in the JSON lines entries must include the question, context, and answer fields.

The output is a trained model that can be deployed for inference.

We provide a subset of SEC filings data of Amazon. It is downloaded from publicly available EDGAR. For instructions on accessing the data, refer to Accessing EDGAR Data.

License: Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)

Read More

Ground truth curation and metric interpretation best practices for evaluating generative AI question answering using FMEval

Ground truth curation and metric interpretation best practices for evaluating generative AI question answering using FMEval

Generative artificial intelligence (AI) applications powered by large language models (LLMs) are rapidly gaining traction for question answering use cases. From internal knowledge bases for customer support to external conversational AI assistants, these applications use LLMs to provide human-like responses to natural language queries. However, building and deploying such assistants with responsible AI best practices requires a robust ground truth and evaluation framework to make sure they meet quality standards and user experience expectations, as well as clear evaluation interpretation guidelines to make the quality and responsibility of these systems intelligible to business decision-makers.

This post focuses on evaluating and interpreting metrics using FMEval for question answering in a generative AI application. FMEval is a comprehensive evaluation suite from Amazon SageMaker Clarify, providing standardized implementations of metrics to assess quality and responsibility. To learn more about FMEval, refer to Evaluate large language models for quality and responsibility.

In this post, we discuss best practices for working with FMEval in ground truth curation and metric interpretation for evaluating question answering applications for factual knowledge and quality. Ground truth data in AI refers to data that is known to be true, representing the expected outcome for the system being modeled. By providing a true expected outcome to measure against, ground truth data unlocks the ability to deterministically evaluate system quality. Ground truth curation and metric interpretation are tightly coupled, and the implementation of the evaluation metric must inform ground truth curation to achieve best results. By following these guidelines, data scientists can quantify the user experience delivered by their generative AI pipelines and communicate meaning to business stakeholders, facilitating ready comparisons across different architectures, such as Retrieval Augmented Generation (RAG) pipelines, off-the-shelf or fine-tuned LLMs, or agentic solutions.

Solution overview

We use an example ground truth dataset (referred to as the golden dataset, shown in the following table) of 10 question-answer-fact triplets. Each triplet describes a fact, and an encapsulation of the fact as a question-answer pair to emulate an ideal response, derived from a knowledge source document. We used Amazon’s Q2 2023 10Q report as the source document from the SEC’s public EDGAR dataset to create 10 question-answer-fact triplets. The 10Q report contains details on company financials and operations over the Q2 2023 business quarter. The golden dataset applies the ground truth curation best practices discussed in this post for most questions, but not all, to demonstrate the downstream impact of ground truth curation on metric results.

Question Answer Fact
Who is Andrew R. Jassy? Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc. Chief Executive Officer of Amazon<OR>CEO of Amazon<OR>President of Amazon
What were Amazon’s total net sales for the second quarter of 2023? Amazon’s total net sales for the second quarter of 2023 were $134.4 billion. 134.4 billion<OR>134,383 million<OR>134183 million<OR>134.383 billion
Where is Amazon’s principal office located? Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210. 410 Terry Avenue North
What was Amazon’s operating income for the six months ended June 30, 2023? Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion. 12.5 billion<OR>12,455 million<OR>12.455 billion
When did Amazon acquire One Medical? Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired. Feb 22 2023<OR>February 22nd 2023<OR>2023-02-22<OR>February 22, 2023
What was a key challenge faced by Amazon’s business in the second quarter of 2023? Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023. foreign exchange rates
What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023? Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion. 50.1 billion<OR>50,067 million<OR>50.067 billion
What were Amazon’s AWS sales for the second quarter of 2023? Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion. 22.1 billion<OR>22,140 million<OR>22.140 billion<OR>22140 million
As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold? As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock. 158 million
How many shares of common stock were outstanding as of July 21, 2023? There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023. 10317750796<OR>10,317,750,796

We generated responses from three generative AI RAG pipelines (anonymized as Pipeline1, Pipeline2, Pipeline3, as shown in the following figure) and calculated factual knowledge and QA accuracy metrics, evaluating them against the golden dataset. The fact key of the triplet is used for the Factual Knowledge metric ground truth, and the answer key is used for the QA Accuracy metric ground truth. With this, factual knowledge is measured against the fact key, and the ideal user experience in terms of style and conciseness is measured against the question-answer pairs.

Diagram outlining three generative AI pipelines run against the golden dataset evaluated using FMEval

Evaluation for question answering in a generative AI application

A generative AI pipeline can have many subcomponents, such as a RAG pipeline. RAG is a methodology to improve the accuracy of LLM responses answering a user query by retrieving and inserting relevant domain knowledge into the language model prompt. RAG quality depends on the configurations of the retriever (chunking, indexing) and generator (LLM selection and hyperparameters, prompt), as illustrated in the following figure. Tuning chunking and indexing in the retriever makes sure the correct content is available in the LLM prompt for generation. The chunk size and chunk splitting method, as well as the means of embedding and ranking relevant document chunks as vectors in the knowledge store, impacts whether the actual answer to the query is ultimately inserted in the prompt. In the generator, selecting an appropriate LLM to run the prompt, and tuning its hyperparameters and prompt template, all control how the retrieved information is interpreted for the response. With this, when a final response from a RAG pipeline is evaluated, the preceding components may be adjusted to improve response quality.

A retrieval augmented generation pipeline shown in components, including chunking, indexing, LLM, and prompt, resulting in a final output

Alternatively, question answering can be powered by a fine-tuned LLM, or through an agentic approach. Although we demonstrate the evaluation of final responses from RAG pipelines, the final responses from a generative AI pipeline for question answering can be similarly evaluated because the prerequisites are a golden dataset and the generative answers. With this approach, changes in the generative output due to different generative AI pipeline architectures can be evaluated to inform the best design choices (comparing RAG and knowledge retrieval agents, comparing LLMs used for generation, retrievers, chunking, prompts, and so on).

Although evaluating each sub-component of a generative AI pipeline is important in development and troubleshooting, business decisions rely on having an end-to-end, side-by-side data view, quantifying how a given generative AI pipeline will perform in terms of user experience. With this, business stakeholders can understand expected quality changes in terms of end-user experience by switching LLMs, and adhere to legal and compliance requirements, such as ISO42001 AI Ethics. There are further financial benefits to realize; for example, quantifying expected quality changes on internal datasets when switching a development LLM to a cheaper, lightweight LLM in production. The overall evaluation process for the benefit of decision-makers is outlined in the following figure. In this post, we focus our discussion on ground truth curation, evaluation, and interpreting evaluation scores for entire question answering generative AI pipelines using FMEval to enable data-driven decision-making on quality.

The business process flow of evaluation, including golden dataset curation, querying the generative pipeline, evaluating responses, interpreting scores, and making data driven business decisions

A useful mental model for ground truth curation and improvement of a golden dataset is a flywheel, as shown in the following figure. The ground truth experimentation process involves querying your generative AI pipeline with the initial golden dataset questions and evaluating the responses against initial golden answers using FMEval. Then, the quality of the golden dataset must be reviewed by a judge. The judge review of the golden dataset quality accelerates the flywheel towards an ever-improving golden dataset. The judge role in the workflow can be assumed by another LLM to enable scaling against established, domain-specific criteria for high-quality ground truth. Maintaining a human-in-the-loop component to the judge function remains essential to sample and verify results, as well as to increase the quality bar with increasing task complexity. Improvement to the golden dataset fosters improvement to the quality of the evaluation metrics, until sufficient measurement accuracy in the flywheel is met by the judge, using the established criteria for quality. To learn more about AWS offerings on human review of generations and data labeling, such as Amazon Augmented AI (Amazon A2I) and Amazon SageMaker Ground Truth Plus, refer to Using Amazon Augmented AI for Human Review and High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus. When using LLMs as a judge, make sure to apply prompt safety best practices.

A flywheel for ground truth experimentation including: 1 - query LLM pipeline, 2- evaluate against ground truth, 3 - Activate the flywheel by judging ground truth quality, 4 - improving the golden dataset

However, to conduct reviews of golden dataset quality as part of the ground truth experiment flywheel, human reviewers must understand the evaluation metric implementation and its coupling to ground truth curation.

FMEval metrics for question answering in a generative AI application

The Factual Knowledge and QA Accuracy metrics from FMEval provide a way to evaluate custom question answering datasets against ground truth. For a full list of metrics implemented with FMEval, refer to Using prompt datasets and available evaluation dimensions in model evaluation jobs.

Factual Knowledge

The Factual Knowledge metric evaluates whether the generated response contains factual information present in the ground truth answer. It is a binary (0 or 1) score based on a string match. Factual knowledge also reports a quasi-exact string match which performs matching after normalization. For simplicity, we focus on the exact match Factual Knowledge score in this post.

For each golden question:

  • 0 indicates the lowercased factual ground truth is not present in the model response
  • 1 indicates the lowercased factual ground truth is present in the response

QA Accuracy

The QA Accuracy metric measures a model’s question answering accuracy by comparing its generated answers against ground truth answers. The metrics are computed by string matching true positive, false positive, and false negative word matches between QA ground truth answers and generated answers.

It includes several sub-metrics:

  • Recall Over Words – Scores from 0 (worst) to 1 (best), measuring how much of the QA ground truth is contained in the model output
  • Precision Over Words – Scores from 0 (worst) to 1 (best), measuring how many words in the model output match the QA ground truth
  • F1 Over Words – The harmonic mean of precision and recall, providing a balanced score from 0 to 1
  • Exact Match – Binary 0 or 1, indicating if the model output exactly matches the QA ground truth
  • Quasi Exact Match – Similar to Exact Match, but with normalization (lowercasing and removing articles)

Because QA Accuracy metrics are calculated on an exact match basis, (for more details, see Accuracy) they may be less reliable for questions where the answer can be rephrased without modifying its meaning. To mitigate this, we propose applying Factual Knowledge as the assessment of factual correctness, motivating the use of a dedicated factual ground truth with minimal word expression, together with QA Accuracy as a measure of idealized user experience in terms of response verbosity and style. We elaborate on these concepts later in this post. The BERTScore is also computed as part of QA Accuracy, which provides a measure of semantic match quality against the ground truth.

Proposed ground truth curation best practices for question answering with FMEval

In this section, we share best practices for curating your ground truth for question answering with FMEval.

Understanding the Factual Knowledge metric calculation

A factual knowledge score is a binary measure of whether a real-world fact was correctly retrieved by the generative AI pipeline. 0 indicates the lower-cased expected answer is not part of the model response, whereas 1 indicates it is. Where there is more than one acceptable answer, and either answer is considered correct, apply a logical operator for OR. A configuration for a logical AND can also be applied for cases where the factual material encompasses multiple distinct entities. In the present examples, we demonstrate a logical OR, using the <OR> delimiter. See Use SageMaker Clarify to evaluate large language models for information about logical operators. An example curation of a golden question and golden fact is shown in the following table.

Golden Question “How many shares of common stock were outstanding as of July 21, 2023?”
Golden Fact 10,317,750,796<OR>10317750796

Fact detection is useful for assessing hallucination in a generative AI pipeline. The two sample responses in the following table illustrate fact detection. The first example correctly states the fact in the example response, and receives a 1.0 score. The second example hallucinates a number instead of stating the fact, and receives a 0 score.

Metric Example Response Score Calculation Approach
Factual Knowledge “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.” 1.0 String match to golden fact
“Based on the documents provided, Amazon had 22,003,237,746 shares of common stock outstanding as of July 21, 2023.” 0.0

In the following example, we highlight the importance of units in ground truth for Factual Knowledge string matching. The golden question and golden fact represent Amazon’s total net sales for the second quarter of 2023.

Golden Question “What were Amazon’s total net sales for the second quarter of 2023?
Golden Fact 134.4 billion<OR>134,383 million

The first response hallucinates the fact, using units of billions, and correctly receives a score of 0.0. The second response correctly represents the fact, in units of millions. Both units should be represented in the golden fact. The third response was unable to answer the question, flagging a potential issue with the information retrieval step.

Metric Example Response Score Calculation Approach
Factual Knowledge Amazon’s total net sales for the second quarter of 2023 were $170.0 billion. 0.0 String match to golden fact
The total consolidated net sales for Q2 2023 were $134,383 million according to this report. 1.0
Sorry, the provided context does not include any information about Amazon’s total net sales for the second quarter of 2023. Would you like to ask another question? 0.0

Interpreting Factual Knowledge scores

Factual knowledge scores are a useful flag for challenges in the generative AI pipeline such as hallucination or information retrieval problems. Factual knowledge scores can be curated in the form of a Factual Knowledge Report for human review, as shown in the following table, to visualize pipeline quality in terms of fact detection side by side.

User Question QA Ground Truth Factual Ground Truth Pipeline 1 Pipeline 2 Pipeline 3
As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold? As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock. 158 million 1 1 1
How many shares of common stock were outstanding as of July 21, 2023? There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023. 10317750796<OR>10,317,750,796 1 1 1
What was Amazon’s operating income for the six months ended June 30, 2023? Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion. 12.5 billion<OR>12,455 million<OR>12.455 billion 1 1 1
What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023? Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion. 50.1 billion<OR>50,067 million<OR>50.067 billion 1 0 0
What was a key challenge faced by Amazon’s business in the second quarter of 2023? Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023. foreign exchange rates 0 0 0
What were Amazon’s AWS sales for the second quarter of 2023? Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion. 22.1 billion<OR>22,140 million<OR>22.140 billion<OR>22140 million 1 0 0
What were Amazon’s total net sales for the second quarter of 2023? Amazon’s total net sales for the second quarter of 2023 were $134.4 billion. 134.4 billion<OR>134,383 million<OR>134183 million<OR>134.383 billion 1 0 0
When did Amazon acquire One Medical? Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired. Feb 22 2023<OR>February 22nd 2023<OR>2023-02-22<OR>February 22, 2023 1 0 1
Where is Amazon’s principal office located? Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210. 410 Terry Avenue North 0 0 0
Who is Andrew R. Jassy? Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc. Chief Executive Officer of Amazon<OR>CEO of Amazon<OR>President of Amazon 1 1 1

Curating Factual Knowledge ground truth

Consider the impact of string matching between your ground truth and LLM responses when curating ground truth for Factual Knowledge. Best practices for curation in consideration of string matching are the following:

  • Use a minimal version of the QA Accuracy ground truth for a factual ground truth containing the most important facts – Because the Factual Knowledge metric uses exact string matching, curating minimal ground truth facts distinct from the QA Accuracy ground truth is imperative. Using QA Accuracy ground truth will not yield a string match unless the response is identical to the ground truth. Apply logical operators as is best suited to represent your facts.
  • Zero factual knowledge scores across the benchmark can indicate a poorly formed golden question-answer-fact triplet – If a golden question doesn’t contain an obvious singular answer, or can be equivalently interpreted multiple ways, reframe the golden question or answer to be specific. In the Factual Knowledge table, a question such as “What was a key challenge faced by Amazon’s business in the second quarter of 2023?” can be subjective, and interpreted with multiple possible acceptable answers. Factual Knowledge scores were 0.0 for all entries because each LLM interpreted a unique answer. A better question would be: “How much did foreign exchange rates reduce Amazon’s International segment net sales?” Similarly, “Where is Amazon’s principal office located?” renders multiple acceptable answers, such as “Seattle,” “Seattle, Washington,” or the street address. The question could be reframed as “What is the street address of Amazon’s principal office?” if this is the desired response.
  • Generate many variations of fact representation in terms of units and punctuation – Different LLMs will use different language to present facts (date formats, engineering units, financial units, and so on). The factual ground truth should accommodate such expected units for the LLMs being evaluated as part of the pipeline. Experimenting with LLMs to automate fact generation from QA ground truth using LLMs can help.
  • Avoid false positive matches – Avoid curating ground truth facts that are overly simple. Short, unpunctuated number sequences, for example, can be matched with years, dates, or phone numbers and can generate false positives.

Understanding QA Accuracy metric calculation

We use the following question answer pair to demonstrate how FMEval metrics are calculated, and how this informs best practices in QA ground truth curation.

Golden Question “How many shares of common stock were outstanding as of July 21, 2023?”
Golden Answer “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

In calculating QA Accuracy metrics, first the responses and ground truth are first normalized (lowercase, remove punctuation, remove articles, remove excess whitespace). Then, true positive, false positives, and false negative matches are computed between the LLM response and the ground truth. QA Accuracy metrics returned by FMEval include recall, precision, F1. By assessing exact matching, the Exact Match and Quasi-Exact Match metrics are returned. A detailed walkthrough of the calculation and scores are shown in the following tables.

The first table illustrates the accuracy metric calculation mechanism.

Metric Definition Example Score
True Positive (TP) The number of words in the model output that are also contained in the ground truth.

Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”

11
False Positive (FP) The number of words in the model output that are not contained in the ground truth.

Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”

7
False Negative (FN) The number of words that are missing from the model output, but are included in the ground truth.

Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”

3

The following table lists the accuracy scores.

Metric Score Calculation Approach
Recall Over Words 0.786 recall formula
Precision Over Words 0.611 precision formula
F1 0.688 f1 score formula
Exact Match 0.0 (Non-normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.
Quasi-Exact Match 0.0 (Normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.

Interpreting QA Accuracy scores

The following are best practices for interpreting QA accuracy scores:

  • Interpret recall as closeness to ground truth – The recall metric in FMEval measures the fraction of ground truth words that are in the model response. With this, we can interpret recall as closeness to ground truth.
    • The higher the recall score, the more ground truth is included in the model response. If the entire ground truth is included in the model response, recall will be perfect (1.0), and if no ground truth is included in the model, response recall will be zero (0.0).
    • Low recall in response to a golden question can indicate a problem with information retrieval, as shown in the example in the following table. A high recall score, however, doesn’t unilaterally indicate a correct response. Hallucinations of facts can present as a single deviated word between model response and ground truth, while still yielding a high true positive rate in word matching. For such cases, you can complement QA Accuracy scores with Factual Knowledge assessments of golden questions in FMEval (we provide examples later in this post).
 Interpretation Question Curated Ground Truth High Closeness to Ground Truth Low Closeness to Ground Truth
Interpreting Closeness to Ground Truth Scores “How many shares of common stock were outstanding as of July 21, 2023?” “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” “As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.” 0.923 “Sorry, I do not have access to documents containing common stock information about Amazon.” 0.111
  • Interpret precision as conciseness to ground truth – The higher the score, the closer the LLM response is to the ground truth in terms of conveying ground truth information in the fewest number of words. By this definition, we recommend interpreting precision scores as a measure of conciseness to the ground truth. The following table demonstrates LLM responses that show high conciseness to the ground truth and low conciseness. Both answers are factually correct, but the reduction in precision is derived from the higher verbosity of the LLM response relative to the ground truth.
 Interpretation Question Curated Ground Truth High Conciseness to Ground Truth Low Conciseness to Ground Truth
Interpreting Conciseness to Ground Truth “How many shares of common stock were outstanding as of July 21, 2023?” “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding. 1.0

“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.

Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states:

‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’

Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.”

0.238
  • Interpret F1 score as combined closeness and conciseness to ground truth – F1 score is the harmonic mean of precision and recall, and so represents a joint measure that equally weights closeness and conciseness for a holistic score. The highest-scoring responses will contain all the words and remain similarly concise as the curated ground truth. The lowest-scoring responses will differ in verbosity relative to the ground truth and contain a large number of words that are not present in the ground truth. Due to the intermixing of these four qualities, F1 score interpretation is subjective. Reviewing recall and precision independently will clearly indicate the qualities of the generative responses in terms of closeness and conciseness. Some examples of high and low F1 scores are provided in the following table.
 Interpretation Question Curated Ground Truth High Combined Closeness x Conciseness Low Combined Closeness x Conciseness
Interpreting Closeness and Conciseness to Ground Truth “How many shares of common stock were outstanding as of July 21, 2023?” “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” “As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.” 0.96

“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.

Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states:

‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’

Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.”

0.364
  • Combine factual knowledge with recall for detection of hallucinated facts and false fact matches – Factual Knowledge scores can be interpreted in combination with recall metrics to distinguish likely hallucinations and false positive facts. For example, the following cases can be caught, with examples in the following table:
    • High recall with zero factual knowledge suggests a hallucinated fact.
    • Zero recall with positive factual knowledge suggests an accidental match between the factual ground truth and an unrelated entity such as a document ID, phone number, or date.
    • Low recall and zero factual knowledge may also suggest a correct answer that has been expressed with alternative language to the QA ground truth. Improved ground truth curation (increased question specificity, more ground truth fact variants) can remediate this problem. The BERTScore can also provide semantic context on match quality.
Interpretation QA Ground Truth Factual Ground Truth Factual Knowledge Recall Score LLM response
Hallucination detection Amazon’s total net sales for the second quarter of 2023 were $134.4 billion. 134.4 billion<OR>134,383 million 0 0.92 Amazon’s total net sales for the second quarter of 2023 were $170.0 billion.
Detect false positive facts There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.

10317750796<OR>

10,317,750,796

1.0 0.0 Document ID: 10317750796
Correct answer, expressed in different words to ground truth question-answer-fact Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210. 410 Terry Avenue North 0 0.54 Amazon’s principal office is located in Seattle, Washington.

Curating QA Accuracy ground truth

Consider the impact of true positive, false positive, and false negative matches between your golden answer and LLM responses when curating your ground truth for QA Accuracy. Best practices for curation in consideration of string matching are as follows:

  • Use LLMs to generate initial golden questions and answers – This is beneficial in terms of speed and level of effort; however, outputs must be reviewed and further curated if necessary before acceptance (see Step 3 of the ground truth experimentation flywheel earlier in this post). Furthermore, applying an LLM to generate your ground truth may bias correct answers towards that LLM, for example, due to string matching of filler words that the LLM commonly uses in its language expression that other LLMs may not. Keeping ground truth expressed in an LLM-agnostic manner is a gold standard.
  • Human review golden answers for proximity to desired output – Your golden answers should reflect your standard for the user-facing assistant in terms of factual content and verbiage. Consider the desired level of verbosity and choice of words you expect as outputs based on your production RAG prompt template. Overly verbose ground truths, and ground truths that adopt language unlikely to be in the model output, will increase false negative scores unnecessarily. Human curation of generated golden answers should reflect the desired verbosity and word choice in addition to accuracy of information, before accepting LLM generated golden answers, to make sure evaluation metrics are computed relative to a true golden standard. Apply guardrails on the verbosity of ground truth, such as controlling word count, as part of the generation process.
  • Compare LLM accuracy using recall – Closeness to ground truth is the best indicator of word agreement between the model response and the ground truth. When golden answers are curated properly, a low recall suggests strong deviation between the ground truth and the model response, whereas a high recall suggests strong agreement.
  • Compare verbosity using precision – When golden answers are curated properly, verbose LLM responses decrease precision scores due to false positives present, and concise LLM responses are rewarded by high precision scores. If the golden answer is highly verbose, however, concise model responses will incur false negatives.
  • Experiment to determine recall acceptability thresholds for generative AI pipelines – A recall threshold for the golden dataset can be set to determine cutoffs for pipeline quality acceptability.
  • Interpret QA accuracy metrics in conjunction with other metrics to pass judgement on accuracy – Metrics such as Factual Knowledge can be combined with QA Accuracy scores to judge factual knowledge in addition to ground truth word matching.

Key takeaways

Curating appropriate ground truth and interpreting evaluation metrics in a feedback loop is crucial for effective business decision-making when deploying generative AI pipelines for question answering.

There were several key takeaways from this experiment:

  • Ground truth curation and metric interpretation are a cyclical process – Understanding how the metrics are calculated should inform the ground truth curation approach to achieve the desired comparison.
  • Low-scoring evaluations can indicate problems with ground truth curation in addition to generative AI pipeline quality – Using golden datasets that don’t reflect true answer quality (misleading questions, incorrect answers, ground truth answers don’t reflect expected response style) can be the root cause of poor evaluation results for a successful pipeline. When golden dataset curation is in place, low-scoring evaluations will correctly flag pipeline problems.
  • Balance recall, precision, and F1 scores – Find the balance between acceptable recall (closeness to ground truth), precision (conciseness to ground truth), and F1 scores (combined) through iterative experimentation and data curation. Pay close attention to what scores quantify your ideal closeness to ground truth and conciseness to the ground truth based on your data and business objectives.
  • Design ground truth verbosity to the level desired in your user experience – For QA Accuracy evaluation, curate ground truth answers that reflect the desired level of conciseness and word choice expected from the production assistant. Overly verbose or unnaturally worded ground truths can unnecessarily decrease precision scores.
  • Use recall and factual knowledge for setting accuracy thresholds – Interpret recall in conjunction with factual knowledge to assess overall accuracy, and establish thresholds by experimentation on your own datasets. Factual knowledge scores can complement recall to detect hallucinations (high recall, false factual knowledge) and accidental fact matches (zero recall, true factual knowledge).
  • Curate distinct QA and factual ground truths – For a Factual Knowledge evaluation, curate minimal ground truth facts distinct from the QA Accuracy ground truth. Generate comprehensive variations of fact representations in terms of units, punctuation, and formats.
  • Golden questions should be unambiguous – Zero factual knowledge scores across the benchmark can indicate poorly formed golden question-answer-fact triplets. Reframe subjective or ambiguous questions to have a specific, singular acceptable answer.
  • Automate, but verify, with LLMs – Use LLMs to generate initial ground truth answers and facts, with a human review and curation to align with the desired assistant output standards. Recognize that applying an LLM to generate your ground truth may bias correct answers towards that LLM during evaluation due to matching filler words, and strive to keep ground truth language LLM-agnostic.

Conclusion

In this post, we outlined best practices for ground truth curation and metric interpretation when evaluating generative AI question answering using FMEval. We demonstrated how to curate ground truth question-answer-fact triplets in consideration of the Factual Knowledge and QA Accuracy metrics calculated by FMEval. To validate our approach, we curated a golden dataset of 10 question-answer-fact triplets from Amazon’s Q2 2023 10Q report. We generated responses from three anonymized generative AI pipelines and calculated QA Accuracy and Factual Knowledge metrics.

Our primary findings emphasize that ground truth curation and metric interpretation are tightly coupled. Ground truth should be curated with the measurement approach in mind, and metrics can update the ground truth during golden dataset development. We further recommend curating separate ground truths for QA accuracy and factual knowledge, particularly emphasizing setting a desired level of verbosity according to user experience goals, and setting golden questions with unambiguous interpretations. Closeness and conciseness to ground truth are valid interpretations of FMEval recall and precision metrics, and factual knowledge scores can be used to detect hallucinations. Ultimately, the quantification of the expected user experience in the form of a golden dataset for pipeline evaluation with FMEval supports business decision-making, such as choosing between pipeline options, projecting quality changes from development to production, and adhering to legal and compliance requirements.

Whether you are building an internal application, a customer-facing virtual assistant, or exploring the potential of generative AI for your business, this post can help you use FMEval to make sure your projects meet the highest standards of quality and responsibility. We encourage you to adopt these best practices and start evaluating your generative AI question answering pipelines with the FMEval toolkit today.


About the Authors

Headshot of Samantha StuartSamantha Stuart is a Data Scientist with AWS Professional Services, and has delivered for customers across generative AI, MLOps, and ETL engagements. Samantha has a research master’s degree in engineering from the University of Toronto, where she authored several publications on data-centric AI for drug delivery system design. Outside of work, she is most likely spotted playing music, spending time with friends and family, at the yoga studio, or exploring Toronto.

Headshot of Rahul JaniRahul Jani is a Data Architect with AWS Professional Services. He collaborates closely with enterprise customers building modern data platforms, generative AI applications, and MLOps. He is specialized in the design and implementation of big data and analytical applications on the AWS platform. Beyond work, he values quality time with family and embraces opportunities for travel.

Headshot of Ivan CuiIvan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.

Headshot of Andrei IvanovicAndrei Ivanovic is a Data Scientist with AWS Professional Services, with experience delivering internal and external solutions in generative AI, AI/ML, time series forecasting, and geospatial data science. Andrei has a Master’s in CS from the University of Toronto, where he was a researcher at the intersection of deep learning, robotics, and autonomous driving. Outside of work, he enjoys literature, film, strength training, and spending time with loved ones.

Read More

Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock

Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock

This post was co-written with Jerry Liu from LlamaIndex.

Retrieval Augmented Generation (RAG) has emerged as a powerful technique for enhancing the capabilities of large language models (LLMs). By combining the vast knowledge stored in external data sources with the generative power of LLMs, RAG enables you to tackle complex tasks that require both knowledge and creativity. Today, RAG techniques are used in every enterprise, small and large, where generative artificial intelligence (AI) is used as an enabler for solving document-based question answering and other types of analysis.

Although building a simple RAG system is straightforward, building production RAG systems using advanced patterns is challenging. A production RAG pipeline typically operates over a larger data volume and larger data complexity, and must meet a higher quality bar compared to building a proof of concept. A general broad challenge that developers face is low response quality; the RAG pipeline is not able to sufficiently answer a large number of questions. This can be due to a variety of reasons; the following are some of the most common:

  • Bad retrievals – The relevant context needed to answer the question is missing.
  • Incomplete responses – The relevant context is partially there but not completely. The generated output doesn’t fully answer the input question.
  • Hallucinations – The relevant context is there but the model is not able to extract the relevant information in order to answer the question.

This necessitates more advanced RAG techniques on the query understanding, retrieval, and generation components in order to handle these failure modes.

This is where LlamaIndex comes in. LlamaIndex is an open source library with both simple and advanced techniques that enables developers to build production RAG pipelines. It provides a flexible and modular framework for building and querying document indexes, integrating with various LLMs, and implementing advanced RAG patterns.

Amazon Bedrock is a managed service providing access to high-performing foundation models (FMs) from leading AI providers through a unified API. It offers a wide range of large models to choose from, along with capabilities to securely build and customize generative AI applications. Key advanced features include model customization with fine-tuning and continued pre-training using your own data, as well as RAG to augment model outputs by retrieving context from configured knowledge bases containing your private data sources. You can also create intelligent agents that orchestrate FMs with enterprise systems and data. Other enterprise capabilities include provisioned throughput for guaranteed low-latency inference at scale, model evaluation to compare performance, and AI guardrails to implement safeguards. Amazon Bedrock abstracts away infrastructure management through a fully managed, serverless experience.

In this post, we explore how to use LlamaIndex to build advanced RAG pipelines with Amazon Bedrock. We discuss how to set up the following:

  • Simple RAG pipeline – Set up a RAG pipeline in LlamaIndex with Amazon Bedrock models and top-k vector search
  • Router query – Add an automated router that can dynamically do semantic search (top-k) or summarization over data
  • Sub-question query – Add a query decomposition layer that can decompose complex queries into multiple simpler ones, and run them with the relevant tools
  • Agentic RAG – Build a stateful agent that can do the preceding components (tool use, query decomposition), but also maintain state-like conversation history and reasoning over time

Simple RAG pipeline

At its core, RAG involves retrieving relevant information from external data sources and using it to augment the prompts fed to an LLM. This allows the LLM to generate responses that are grounded in factual knowledge and tailored to the specific query.

For RAG workflows in Amazon Bedrock, documents from configured knowledge bases go through preprocessing, where they are split into chunks, embedded into vectors, and indexed in a vector database. This allows efficient retrieval of relevant information at runtime. When a user query comes in, the same embedding model is used to convert the query text into a vector representation. This query vector is compared against the indexed document vectors to identify the most semantically similar chunks from the knowledge base. The retrieved chunks provide additional context related to the user’s query. This contextual information is appended to the original user prompt before being passed to the FM to generate a response. By augmenting the prompt with relevant data pulled from the knowledge base, the model’s output is able to use and be informed by an organization’s proprietary information sources. This RAG process can also be orchestrated by agents, which use the FM to determine when to query the knowledge base and how to incorporate the retrieved context into the workflow.

The following diagram illustrates this workflow.

The following is a simplified example of a RAG pipeline using LlamaIndex:

from llama_index import SimpleDirectoryReader, VectorStoreIndex

# Load documents
documents = SimpleDirectoryReader("data/").load_data()

# Create a vector store index
index = VectorStoreIndex.from_documents(documents)

# Query the index
response = index.query("What is the capital of France?")

# Print the response
print(response)

The pipeline includes the following steps:

  1. Use the SimpleDirectoryReader to load documents from the “data/”
  2. Create a VectorStoreIndex from the loaded documents. This type of index converts documents into numerical representations (vectors) that capture their semantic meaning.
  3. Query the index with the question “What is the capital of France?” The index uses similarity measures to identify the documents most relevant to the query.
  4. The retrieved documents are then used to augment the prompt for the LLM, which generates a response based on the combined information.

LlamaIndex goes beyond simple RAG and enables the implementation of more sophisticated patterns, which we discuss in the following sections.

Router query

RouterQueryEngine allows you to route queries to different indexes or query engines based on the nature of the query. For example, you could route summarization questions to a summary index and factual questions to a vector store index.

The following is a code snippet from the example notebooks demonstrating RouterQueryEngine:

from llama_index import SummaryIndex, VectorStoreIndex
from llama_index.core.query_engine import RouterQueryEngine

# Create summary and vector indices
summary_index = SummaryIndex.from_documents(documents)
vector_index = VectorStoreIndex.from_documents(documents)

# Define query engines
summary_query_engine = summary_index.as_query_engine()
vector_query_engine = vector_index.as_query_engine()

# Create router query engine
query_engine = RouterQueryEngine(
 # Define logic for routing queries
 # ...
 query_engine_tools=[
 summary_query_engine,
 vector_query_engine,
 ],
)

# Query the engine
response = query_engine.query("What is the main idea of the document?")

Sub-question query

SubQuestionQueryEngine breaks down complex queries into simpler sub-queries and then combines the answers from each sub-query to generate a comprehensive response. This is particularly useful for queries that span across multiple documents. It first breaks down the complex query into sub-questions for each relevant data source, then gathers the intermediate responses and synthesizes a final response that integrates the relevant information from each sub-query. For example, if the original query was “What is the population of the capital city of the country with the highest GDP in Europe,” the engine would first break it down into sub-queries like “What is the highest GDP country in Europe,” “What is the capital city of that country,” and “What is the population of that capital city,” and then combine the answers to those sub-queries into a final comprehensive response.

The following is an example of using SubQuestionQueryEngine:

from llama_index.core.query_engine import SubQuestionQueryEngine

# Create sub-question query engine
sub_question_query_engine = SubQuestionQueryEngine.from_defaults(
 # Define tools for generating sub-questions and answering them
 # ...
)

# Query the engine
response = sub_question_query_engine.query(
 "Compare the revenue growth of Uber and Lyft from 2020 to 2021"
)

Agentic RAG

An agentic approach to RAG uses an LLM to reason about the query and determine which tools (such as indexes or query engines) to use and in what sequence. This allows for a more dynamic and adaptive RAG pipeline. The following architecture diagram shows how agentic RAG works on Amazon Bedrock.

Agentic RAG in Amazon Bedrock combines the capabilities of agents and knowledge bases to enable RAG workflows. Agents act as intelligent orchestrators that can query knowledge bases during their workflow to retrieve relevant information and context to augment the responses generated by the FM.

After the initial preprocessing of the user input, the agent enters an orchestration loop. In this loop, the agent invokes the FM, which generates a rationale outlining the next step the agent should take. One potential step is to query an attached knowledge base to retrieve supplemental context from the indexed documents and data sources.

If a knowledge base query is deemed beneficial, the agent invokes an InvokeModel call specifically for knowledge base response generation. This fetches relevant document chunks from the knowledge base based on semantic similarity to the current context. These retrieved chunks provide additional information that is included in the prompt sent back to the FM. The model then generates an observation response that is parsed and can invoke further orchestration steps, like invoking external APIs (through action group AWS Lambda functions) or provide a final response to the user. This agentic orchestration augmented by knowledge base retrieval continues until the request is fully handled.

One example of an agent orchestration loop is the ReAct agent, which was initially introduced by Yao et al. ReAct interleaves chain-of-thought and tool use. At every stage, the agent takes in the input task along with the previous conversation history and decides whether to invoke a tool (such as querying a knowledge base) with the appropriate input or not.

The following is an example of using the ReAct agent with the LlamaIndex SDK:

from llama_index.core.agent import ReActAgent

# Create ReAct agent with defined tools
agent = ReActAgent.from_tools(
 query_engine_tools,
 llm=llm,
)

# Chat with the agent
response = agent.chat("What was Lyft's revenue growth in 2021?")

The ReAct agent will analyze the query and decide whether to use the Lyft 10K tool or another tool to answer the question. To try out agentic RAG, refer to the GitHub repo.

LlamaCloud and LlamaParse

LlamaCloud represents a significant advancement in the LlamaIndex landscape, offering a comprehensive suite of managed services tailored for enterprise-grade context augmentation within LLM and RAG applications. This service empowers AI engineers to concentrate on developing core business logic by streamlining the intricate process of data wrangling.

One key component is LlamaParse, a proprietary parsing engine adept at handling complex, semi-structured documents replete with embedded objects like tables and figures, seamlessly integrating with LlamaIndex’s ingestion and retrieval pipelines. Another key component is the Managed Ingestion and Retrieval API, which facilitates effortless loading, processing, and storage of data from diverse sources, including LlamaParse outputs and LlamaHub’s centralized data repository, while accommodating various data storage integrations.

Collectively, these features enable the processing of vast production data volumes, culminating in enhanced response quality and unlocking unprecedented capabilities in context-aware question answering for RAG applications. To learn more about these features, refer to Introducing LlamaCloud and LlamaParse.

For this post, we use LlamaParse to showcase the integration with Amazon Bedrock. LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks. What is unique about LlamaParse is that it is the world’s first generative AI native document parsing service, which allows users to submit documents along with parsing instructions. The key insight behind parsing instructions is that you know what kind of documents you have, so you already know what kind of output you want. The following figure shows a comparison of parsing a complex PDF with LlamaParse vs. two popular open source PDF parsers.

A green highlight in a cell means that the RAG pipeline correctly returned the cell value as the answer to a question over that cell. A red highlight means that the question was answered incorrectly.

Integrate Amazon Bedrock and LlamaIndex to build an Advanced RAG Pipeline

In this section, we show you how to build an advanced RAG stack combining LlamaParse and LlamaIndex with Amazon Bedrock services – LLMs, embedding models, and Bedrock Knowledge Base.

To use LlamaParse with Amazon Bedrock, you can follow these high-level steps:

  1. Download your source documents.
  2. Send the documents to LlamaParse using the Python SDK:
    from llama_parse import LlamaParse
    from llama_index.core import SimpleDirectoryReader
    
    parser = LlamaParse(
        api_key=os.environ.get('LLAMA_CLOUD_API_KEY'),  # set via api_key param or in your env as LLAMA_CLOUD_API_KEY
        result_type="markdown",  # "markdown" and "text" are available
        num_workers=4,  # if multiple files passed, split in `num_workers` API calls
        verbose=True,
        language="en",  # Optionally you can define a language, default=en
    )
    
    file_extractor = {".pdf": parser}
    reader = SimpleDirectoryReader(
        input_dir='data/10k/',
        file_extractor=file_extractor
    )

  3. Wait for the parsing job to finish and upload the resulting Markdown documents to Amazon Simple Storage Service (Amazon S3).
  4. Create an Amazon Bedrock knowledge base using the source documents.
  5. Choose your preferred embedding and generation model from Amazon Bedrock using the LlamaIndex SDK:
    llm = Bedrock(model = "anthropic.claude-v2")
    embed_model = BedrockEmbedding(model = "amazon.titan-embed-text-v1")

  6. Implement an advanced RAG pattern using LlamaIndex. In the following example, we use SubQuestionQueryEngine and a retriever specially created for Amazon Bedrock knowledge bases:
    from llama_index.retrievers.bedrock import AmazonKnowledgeBasesRetriever

  7. Finally, query the index with your question:
    response = await query_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')

We tested Llamaparse on a real-world, challenging example of asking questions about a document containing Bank of America Q3 2023 financial results. An example slide from the full slide deck (48 complex slides!) is shown below.

Using the procedure outlined above, we asked “What is the trend in digital households/relationships from 3Q20 to 3Q23?”; take a look at the answer generated using Llamaindex tools vs. the reference answer from human annotation.

LlamaIndex + LlamaParse answer Reference answer
The trend in digital households/relationships shows a steady increase from 3Q20 to 3Q23. In 3Q20, the number of digital households/relationships was 550K, which increased to 645K in 3Q21, then to 672K in 3Q22, and further to 716K in 3Q23. This indicates consistent growth in the adoption of digital services among households and relationships over the reported quarters. The trend shows a steady increase in digital households/relationships from 645,000 in 3Q20 to 716,000 in 3Q23. The digital adoption percentage also increased from 76% to 83% over the same period.

The following are example notebooks to try out these steps on your own examples. Note the prerequisite steps and cleanup resources after testing them.

Conclusion

In this post, we explored various advanced RAG patterns with LlamaIndex and Amazon Bedrock. To delve deeper into the capabilities of LlamaIndex and its integration with Amazon Bedrock, check out the following resources:

By combining the power of LlamaIndex and Amazon Bedrock, you can build robust and sophisticated RAG pipelines that unlock the full potential of LLMs for knowledge-intensive tasks.


About the Author

Shreyas Subramanian is a Principal data scientist and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.

Jerry Liu is the co-founder/CEO of LlamaIndex, a data framework for building LLM applications. Before this, he has spent his career at the intersection of ML, research, and startups. He led the ML monitoring team at Robust Intelligence, did self-driving AI research at Uber ATG, and worked on recommendation systems at Quora.

Read More