AWS achieves ISO/IEC 42001:2023 Artificial Intelligence Management System accredited certification

Amazon Web Services (AWS) is excited to be the first major cloud service provider to announce ISO/IEC 42001 accredited certification for AI services, covering: Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe. ISO/IEC 42001 is an international management system standard that outlines requirements and controls for organizations to promote the responsible development and use of AI systems.

Responsible AI is a long-standing commitment at AWS. From the outset, AWS has prioritized responsible AI innovation and developed rigorous methodologies to build and operate our AI services with consideration for fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency.

AWS is an active stakeholder working with global standard-setting organizations to develop guidelines that play an important role in our industry by improving clarity, definitions and scope, establishing benchmarks for responsible AI practices, and focusing industry efforts on effective options for addressing risk. Our goal is to contribute to and improve AI standards across several critical areas, including risk management, data quality, unwanted bias mitigation, and explainability.

Technical standards, such as ISO/IEC 42001, are significant because they provide a common framework for responsible AI development and deployment, fostering trust and interoperability in an increasingly global and AI-driven technological landscape. Achieving ISO/IEC42001 certification means that an independent third party has validated that AWS is taking proactive steps to manage risks and opportunities associated with AI development, deployment, and operation. This independent validation enables our customers to gain further assurances around AWS’s commitment to responsible AI and their ability to build and operate AI applications responsibly using AWS Services.

“At Snowflake, delivering AI capabilities to our customers is a top priority. Our product teams need to build and deploy AI responsibly, and have to depend upon our suppliers to do the same, despite the technical complexity. This is why ISO 42001 is important to us. Having ISO 42001 certification means a company has implemented a thoughtful AI management system. Knowing that AWS has ISO 42001 certified services gives us confidence in their commitment to the responsible development and deployment of their services, and builds trust with our own customers’ confidence in our products,” said Tim Tutt, VP US Public Sector, Snowflake.

An accredited certification, like ISO/IEC 42001, is issued by a certification body that has been recognized by a national or international accreditation authority. This demonstrates that the certification is credible, trustworthy, and based on independent verification. In this case, Schellman Compliance, LLC, an ISO certification body accredited by the ANSI National Accreditation Board (ANAB), granted the certification.

Read More

How 123RF saved over 90% of their translation costs by switching to Amazon Bedrock

How 123RF saved over 90% of their translation costs by switching to Amazon Bedrock

In the rapidly evolving digital content industry, multilingual accessibility is crucial for global reach and user engagement. 123RF, a leading provider of royalty-free digital content, is an online resource for creative assets, including AI-generated images from text. In 2023, they used Amazon OpenSearch Service to improve discovery of images by using vector-based semantic search. Building on this success, they have now implemented Amazon Bedrock and Anthropic’s Claude 3 Haiku to improve their content moderation a hundredfold and more sped up content translation to further enhance their global reach and efficiency.

Although the company achieved significant success among English-speaking users with its generative AI-based semantic search tool, it faced content discovery challenges in 15 other languages because of English-only titles and keywords. The cost of using Google Translate for continuous translations was prohibitive, and other models such as Anthropic’s Claude Sonnet and OpenAI GPT-4o weren’t cost-effective. Although OpenAI GPT-3.5 met cost criteria, it struggled with consistent output quality. This prompted 123RF to search for a more reliable and affordable solution to enhance multilingual content discovery.

This post explores how 123RF used Amazon Bedrock, Anthropic’s Claude 3 Haiku, and a vector store to efficiently translate content metadata, significantly reduce costs, and improve their global content discovery capabilities.

The challenge: Balancing quality and cost in mass translation

After implementing generative AI-based semantic search and text-to-image generation, they saw significant traction among English-speaking users. This success, however, cast a harsh light on a critical gap in their global strategy: their vast library of digital assets—comprising millions of images, audio files, and motion graphics—needed a similar overhaul for non-English speaking users.

The crux of the problem lay in the nature of their content. User-generated titles, keywords, and descriptions—the lifeblood of searchability in the digital asset world—were predominantly in English. To truly serve a global audience and unlock the full potential of their library, 123RF needed to translate this metadata into 15 different languages. But as they quickly discovered, the path to multilingual content was filled with financial and technical challenges.

The translation conundrum: Beyond word-for-word

image explaining idioms

Idioms don’t always translate well

As 123RF dove deeper into the challenge, they uncovered layers of complexity that went beyond simple word-for-word translation. The preceding figure shows one particularly difficult example: idioms. Phrases like “The early bird gets the worm” being literally translated would not convey the meaning of the word as well as another similar idiom in Spanish, “A quien madruga, Dios le ayuda”. Another significant hurdle was named entity resolution (NER)—a critical aspect for a service dealing with diverse visual and audio content.

NER involves correctly identifying and handling proper nouns, brand names, specific terminology, and culturally significant references across languages. For instance, a stock photo of the Eiffel Tower should retain its name in all languages, rather than being literally translated. Similarly, brand names like Coca-Cola or Nike should remain unchanged, regardless of the target language.

This challenge is particularly acute in the realm of creative content. Consider a hypothetical stock image titled Young woman using MacBook in a Starbucks. An ideal translation system would need to do the following:

  • Recognize MacBook and Starbucks as brand names that should not be translated
  • Correctly translate Young woman while preserving the original meaning and connotations
  • Handle the preposition in appropriately, which might change based on the grammatical rules of the target language
  • Moreover, the system needed to handle industry-specific jargon, artistic terms, and culturally specific concepts that might not have direct equivalents in other languages. For instance, how would one translate bokeh effect into languages where this photographic term isn’t commonly used?

These nuances highlighted the inadequacy of simple machine translation tools and underscored the need for a more sophisticated, context-aware solution.

Turning to language models: Large models compared to small models

In their quest for a solution, 123RF explored a spectrum of options, each with its own set of trade-offs:

  • Google Translate – The incumbent solution offered reliability and ease of use. However, it came with a staggering price tag. The company had to clear their backlog of 45 million translations. Adding to this, there was an ongoing monthly financial burden for new content that their customers generated. Though effective, this option threatened to cut into 123RF’s profitability, making it unsustainable in the long run.
  • Large language models – Next, 123RF turned to cutting-edge large language models (LLMs) such as OpenAI GPT-4 and Anthropic’s Claude Sonnet. These models showcased impressive capabilities in understanding context and producing high-quality translations. However, the cost of running these sophisticated models at 123RF’s scale proved prohibitive. Although they excelled in quality, they fell short in cost-effectiveness for a business dealing with millions of short text snippets.
  • Smaller models – In an attempt to find a middle ground, 123RF experimented with less capable models such as OpenAI GPT-3.5. These offered a more palatable price point, aligning better with 123RF’s budget constraints. However, this cost savings came at a price: inconsistency in output quality. The translations, although sometimes acceptable, lacked the reliability and nuance required for professional-grade content description.
  • Fine-tuning – 123RF briefly considered fine-tuning a smaller language model to further reduce cost. However, they understood there would be a number of hurdles: they would have to regularly fine-tune models as new model updates occur, hire subject matter experts to train the models and manage their upkeep and deployment, and potentially manage a model for each of the output languages.

This exploration laid bare a fundamental challenge in the AI translation space: the seemingly unavoidable trade-off between cost and quality. High-quality translations from top-tier models were financially unfeasible, whereas more affordable options couldn’t meet the standard of accuracy and consistency that 123RF’s business demanded.

Solution: Amazon Bedrock, Anthropic’s Claude 3 Haiku, prompt engineering, and a vector store

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Throughout this transformative journey, Amazon Bedrock proved to be the cornerstone of 123RF’s success. Several factors contributed to making it the provider of choice:

  • Model variety – Amazon Bedrock offers access to a range of state-of-the-art language models, allowing 123RF to choose the one best suited for their specific needs, like Anthropic’s Claude 3 Haiku.
  • Scalability – The ability of Amazon Bedrock to handle massive workloads efficiently was crucial for processing millions of translations.
  • Cost-effectiveness – The pricing model of Amazon Bedrock, combined with its efficient resource utilization, played a key role in achieving the dramatic cost reduction.
  • Integration capabilities – The ease of integrating Amazon Bedrock with other AWS services facilitated the implementation of advanced features such as a vector database for dynamic prompting.
  • Security and compliance – 123RF works with user-generated content, and the robust security features of Amazon Bedrock provided peace of mind in handling potentially sensitive information.
  • Flexibility for custom solutions – The openness of Amazon Bedrock to custom implementations, such as the dynamic prompting technique, allowed 123RF to tailor the solution precisely to their needs

Cracking the code: Prompt engineering techniques

The first breakthrough in 123RF’s translation journey came through a collaborative effort with the AWS team, using the power of Amazon Bedrock and Anthropic’s Claude 3 Haiku. The key to their success lay in the innovative application of prompt engineering techniques—a set of strategies designed to coax the best performance out of LLMs, especially important for cost effective models.

Prompt engineering is crucial when working with LLMs because these models, while powerful, can produce non-deterministic outputs—meaning their responses can vary even for the same input. By carefully crafting prompts, we can provide context and structure that helps mitigate this variability. Moreover, well-designed prompts serve to steer the model towards the specific task at hand, ensuring that the LLM focuses on the most relevant information and produces outputs aligned with the desired outcome. In 123RF’s case, this meant guiding the model to produce accurate, context-aware translations that preserved the nuances of the original content.

Let’s dive into the specific techniques employed.

Assigning a role to the model

The team began by assigning the AI model a specific role—that of an AI language translation assistant. This seemingly simple step was crucial in setting the context for the model’s task. By defining its role, the model was primed to approach the task with the mindset of a professional translator, considering nuances and complexities that a generic language model might overlook.

For example:

You are an AI language translation assistant. 
Your task is to accurately translate a passage of text from English into another specified language.

Separation of data and prompt templates

A clear delineation between the text to be translated and the instructions for translation was implemented. This separation served two purposes:

  • Provided clarity in the model’s input, reducing the chance of confusion or misinterpretation
  • Allowed for simpler automation and scaling of the translation process, because the same prompt template could be used with different input texts

For example:

Here is the text to translate:
<text> {{TEXT}} </text>
Please translate the above text into this language: {{TARGET_LANGUAGE}}

Chain of thought

One of the most innovative aspects of the solution was the implementation of a scratchpad section. This allowed the model to externalize its thinking process, mimicking the way a human translator might work through a challenging passage.

The scratchpad prompted the model to consider the following:

  • The overall meaning and intent of the passage
  • Idioms and expressions that might not translate literally
  • Tone, formality, and style of the writing
  • Proper nouns such as names and places that should not be translated
  • Grammatical differences between English and the target language
  • This step-by-step thought process significantly improved the quality and accuracy of translations, especially for complex or nuanced content.

K-shot examples

The team incorporated multiple examples of high-quality translations directly into the prompt. This technique, known as K-shot learning, provided the model with a number (K) of concrete examples in the desired output quality and style.

By carefully selecting diverse examples that showcased different translation challenges (such as idiomatic expressions, technical terms, and cultural references), the team effectively trained the model to handle a wide range of content types.

For example:

Examples:
<text>The early bird catches the worm.</text>
<translated_text>El que madruga, Dios le ayuda.</translated_text>

The magic formula: Putting it all together

The culmination of these techniques resulted in a prompt template that encapsulated the elements needed for high-quality, context-aware translation. The following is an example prompt with the preceding steps. The actual prompt used is not shown here.

You are an AI language translation assistant. Your task is to accurately translate a passage of text from English into another specified language. Here is the text to translate:
<text> {{TEXT}} </text>
Please translate the above text into this language: {{TARGET_LANGUAGE}}
Think carefully, in the <scratchpad> section below, think through how you will translate the text while preserving its full meaning and nuance. Consider:
- The overall meaning and intent of the passage
- Idioms and expressions that may not translate literally
- Tone, formality, and style of the writing
- Proper nouns like names and places that should not be translated
- Grammatical differences between English and {{TARGET_LANGUAGE}}
Examples:
<text>The software update is scheduled for next Tuesday.</text>
<translated_text>La actualización del software está programada para el próximo martes.</translated_text>
<text>Breaking news: Elon Musk acquires Twitter for $44 billion.</text>
<translated_text>Última hora: Elon Musk adquiere Twitter por 44 mil millones de dólares.</translated_text>
... [8 more diverse examples] ...
Now provide your final translated version of the text inside <translated_text> tags. Ensure the translation is as accurate and natural-sounding as possible in {{TARGET_LANGUAGE}}. Do not translate any names, places or other proper nouns.
<translated_text>

This template provided a framework for consistent, high-quality translations across a wide range of content types and target languages.

Further refinement: Dynamic prompting for grounding models

Although the initial implementation yielded impressive results, the AWS team suggested further enhancements through dynamic prompting techniques. This advanced approach aimed to make the model even more adaptive and context aware. They adopted the Retrieval Augmented Generation (RAG) technique for creating a dynamic prompt template with K-shot examples relevant to each phrase rather than generic examples for each language. This also allowed 123RF to take advantage of their current catalog of high quality translations to further align the model.

Vector database of high-quality translations

The team proposed creating a vector database for each target language, populated with previous high-quality translations. This database would serve as a rich repository of translation examples, capturing nuances and domain-specific terminologies.

The implementation included the following components:

  1. Embedding generation:
    • Use embedding models such as Amazon Titan or Cohere’s offerings on Amazon Bedrock to convert both source texts and their translations into high-dimensional vectors.
  2. Chunking strategy:
    • To maintain context and ensure meaningful translations, the team implemented a careful chunking strategy:
      1. Each source text (in English) was paired with its corresponding translation in the target language.
      2. These pairs were stored as complete sentences or logical phrases, rather than individual words or arbitrary character lengths.
      3. For longer content, such as paragraphs or descriptions, the text was split into semantically meaningful chunks, ensuring that each chunk contained a complete thought or idea.
      4. Each chunk pair (source and translation) was assigned a unique identifier to maintain the association.
  3. Vector storage:
    • The vector representations of both the source text and its translation were stored together in the database.
    • The storage structure included:
      1. The original source text chunk.
      2. The corresponding translation chunk.
      3. The vector embedding of the source text.
      4. The vector embedding of the translation.
      5. Metadata such as the content type, domain, and any relevant tags.
  4. Database organization:
    • The database was organized by target language, with separate indices or collections for each language pair (for example, English-Spanish and English-French).
    • Within each language pair, the vector pairs were indexed to allow for efficient similarity searches.
  5. Similarity search:
    • For each new translation task, the system would perform a hybrid search to find the most semantically similar sentences from the vector database:
      1. The new text to be translated was converted into a vector using the same embedding model.
      2. A similarity search was performed in the vector space to find the closest matches in the source language.
      3. The corresponding translations of these matches were retrieved, providing relevant examples for the translation task.

This structured approach to storing and retrieving text-translation pairs allowed for efficient, context-aware lookups that significantly improved the quality and relevance of the translations produced by the LLM.

Putting it all together

The top matching examples from the vector database would be dynamically inserted into the prompt, providing the model with highly relevant context for the specific translation task at hand.

This offered the following benefits:

  • Improved handling of domain-specific terminology and phraseology
  • Better preservation of style and tone appropriate to the content type
  • Enhanced ability to resolve named entities and technical terms correctly

The following is an example of a dynamically generated prompt:

[Standard prompt preamble]
...
Examples:
<text>{{Dynamically inserted similar source text 1}}</text>
<translated_text>{{Corresponding high-quality translation 1}}</translated_text>
<text>{{Dynamically inserted similar source text 2}}</text>
<translated_text>{{Corresponding high-quality translation 2}}</translated_text>
...
[Rest of the standard prompt]

This dynamic approach allowed the model to continuously improve and adapt, using the growing database of high-quality translations to inform future tasks.
The following diagram illustrates the process workflow.

dynamic prompting with RAG

How to ground translations with a vector store

The process includes the following steps:

  1. Convert the new text to be translated into a vector using the same embeddings model.
  2. Compare text and embeddings against a database of high-quality existing translations.
  3. Combine similar translations with an existing prompt template of generic translation examples for target language.
  4. Send the new augmented prompt with initial text to be translated to Amazon Bedrock.
  5. Store the output of the translation in an existing database or to be saved for human-in-the-loop evaluation.

The results: A 95% cost reduction and beyond

The impact of implementing these advanced techniques on Amazon Bedrock with Anthropic’s Claude 3 Haiku and the engineering effort with AWS account teams was nothing short of innovative for 123RF. By working with AWS, 123RF was able to achieve a staggering 95% reduction in translation costs. But the benefits extended far beyond cost savings:

  • Scalability – The new solution with Anthropic’s Claude 3 Haiku allowed 123RF to rapidly expand their multilingual offerings. They quickly rolled out translations for 9 languages, with plans to cover all 15 target languages in the near future.
  • Quality improvement – Despite the massive cost reduction, the quality of translations saw a marked improvement. The context-aware nature of the LLM, combined with careful prompt engineering, resulted in more natural and accurate translations.
  • Handling of edge cases – The system showed remarkable prowess in handling complex cases such as idiomatic expressions and technical jargon, which had been pain points with previous solutions.
  • Faster time-to-market – The efficiency of the new system significantly reduced the time required to make new content available in multiple languages, giving 123RF a competitive edge in rapidly updating their global offerings.
  • Resource reallocation – The cost savings allowed 123RF to reallocate resources to other critical areas of their business, fostering innovation and growth.

Looking ahead: Continuous improvement and expansion

The success of this project has opened new horizons for 123RF and set the stage for further advancements:

  • Expanding language coverage – With the cost barrier significantly lowered, 123RF is now planning to expand their language offerings beyond the initial 15 target languages, potentially tapping into new markets and user bases.
  • Anthropic’s Claude 3.5 Haiku – The recent release of Anthropic’s Claude 3.5 Haiku has sparked excitement at 123RF. This upcoming model promises even greater intelligence and efficiency, potentially allowing for further refinements in translation quality and cost-effectiveness.
  • Broader AI integration – Encouraged by the success in translation, 123RF is exploring additional use cases for generative AI within their operations. Potential areas include the following:
    • Enhanced image tagging and categorization.
    • Content moderation of user-generated images.
    • Personalized content recommendations for users.
  • Continuous learning loop – The team is working on implementing a feedback mechanism where successful translations are automatically added to the vector database, creating a virtuous cycle of continuous improvement.
  • Cross-lingual search enhancement – Using the improved translations, 123RF is developing more sophisticated cross-lingual search capabilities, allowing users to find relevant content regardless of the language they search in.
  • Prompt catalog – They can explore the newly launched Amazon Bedrock Prompt Management as a way to manage prompt templates and iterate on them effectively.

Conclusion

123RF’s success story with Amazon Bedrock and Anthropic’s Claude is more than just a tale of cost reduction—it’s a blueprint for how businesses can use cutting-edge AI to break down language barriers and truly globalize their digital content. This case study demonstrates the transformative power of innovative thinking, advanced prompt engineering, and the right technological partnership.

123RF’s journey offers the following key takeaways:

  • The power of prompt engineering in extracting optimal performance from LLMs
  • The importance of context and domain-specific knowledge in AI translations
  • The potential of dynamic, adaptive AI solutions in solving complex business challenges
  • The critical role of choosing the right technology partner and platform

As we look to the future, it’s clear that the combination of cloud computing, generative AI, and innovative prompt engineering will continue to reshape the landscape of multilingual content management. The barriers of language are crumbling, opening up new possibilities for global communication and content discovery.

For businesses facing similar challenges in global content discovery, 123RF’s journey offers valuable insights and a roadmap to success. It demonstrates that with the right technology partner and a willingness to innovate, even the most daunting language challenges can be transformed into opportunities for growth and global expansion. If you have a similar use case and want help implementing this technique, reach out to your AWS account teams, or sharpen your prompt engineering skills through our prompt engineering workshop available on GitHub.


About the Author

Fahim Surani pictureFahim Surani is a Solutions Architect at Amazon Web Services who helps customers innovate in the cloud. With a focus in Machine Learning and Generative AI, he works with global digital native companies and financial services to architect scalable, secure, and cost-effective products and services on AWS. Prior to joining AWS, he was an architect, an AI engineer, a mobile games developer, and a software engineer. In his free time he likes to run and read science fiction.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, AWS’ flagship generative AI offering for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.

Read More

Connect SharePoint Online to Amazon Q Business using OAuth 2.0 ROPC flow authentication

Connect SharePoint Online to Amazon Q Business using OAuth 2.0 ROPC flow authentication

Enterprises face significant challenges accessing and utilizing the vast amounts of information scattered across organization’s various systems. What if you could simply ask a question and get instant, accurate answers from your company’s entire knowledge base, while accounting for an individual user’s data access levels?

Amazon Q Business is a game changing AI assistant that’s revolutionizing how enterprises interact with their data. With Amazon Q Business, you can access relevant information through natural language conversations, drawing insights from diverse data sources within your organization, adhering to the permissions granted to your user account.

At its core, Amazon Q Business works by first indexing the content from a variety of data sources using built-in data source connectors. These connectors function as an integration layer, unifying content from diverse systems such as Salesforce, Microsoft Exchange, and SharePoint into a centralized index. This consolidated index powers the natural language processing and response generation capabilities of Amazon Q. When a user asks a question using the built-in web experience, Amazon Q Business retrieves relevant content from the index, taking into account user profiles and permissions. It then uses large language models (LLMs) to provide accurate, personalized, and well-written responses based on the consolidated data.

For a full list of Amazon Q supported data source connectors, refer to Supported connectors.

This approach is useful when you need Amazon Q Business to crawl through OneNote or when certificate-based authentication is not preferred or your organization has a strict policy that requires regular password rotation. For a complete list of authentication mechanisms, refer to SharePoint (Online) connector overview.

We provide a step-by-step guide for the Azure AD configuration and demonstrate how to set up the Amazon Q connector to establish this secure integration.

Solution overview

SharePoint is a web-based solution developed by Microsoft that enables organizations to collaborate, manage documents, and share information efficiently. It offers a wide range of features, including using document libraries, viewing lists, publishing pages, sharing events and links, and allowing users to make comments, making it a great tool for team collaboration and content management.

After integrating SharePoint Online with Amazon Q Business, you can ask questions using natural language about the content stored in the SharePoint sites. For example, if your organization’s human resources team manages an internal SharePoint site and maintains a list of holidays for geographical regions, you can ask, “What are the company holidays for this year?” Amazon Q Business will then list region-specific holidays based on your location (country).

The following diagram illustrates the solution architecture. In the upcoming sections, we show you how to implement this architecture. After you integrate Amazon Q Business using the SharePoint connector, Amazon Q Business will crawl through the SharePoint content and update the index whenever content changes. Each published event, page, link, file, comment, OneNote, and attachment on the SharePoint site is treated as a document. In addition to the documents, it also crawls through access control lists (ACLs) for each document (user and group information) and stores them in the . This allows end-users to see chat responses generated only from the documents they have access to.

You can configure Azure AD using either of the following methods:

  • Use the Azure AD console GUI – This is a manual process
  • Use the provided PowerShell script – This is an automated process that takes in the inputs and configures the required permissions

We demonstrate both methods in the following sections.

Prerequisites

To follow along, you need the following prerequisites:

  • The user performing these steps should be a global administrator on Azure AD/Entra ID.
  • You need to configure Microsoft Entra ID and AWS IAM Identity Center. For details, see Configure SAML and SCIM with Microsoft Entra ID and IAM Identity Center.
  • You need a Microsoft Windows instance to run PowerShell scripts and commands with PowerShell 7.4.1+. Details of the required PowerShell modules are described in the following steps.
  • The user should have administrator permissions on the Windows instance.
  • The user running these PowerShell commands should have the right M365 license (for example, M365 E3).

Configure Azure AD using the Azure AD console

To configure Azure AD using the GUI, complete the steps in this section.

Register an Azure AD application

Complete the following steps to register an Azure AD application in the Azure AD tenant that is linked to the SharePoint Online/O365 tenant:

  1. Open the Office 365 Admin Center using the account of a user member of the Tenant Global Admins group.
  2. Navigate to Microsoft Azure Portal.
  3. Search for and choose App registrations.

  1. Choose New registration.

  1. Enter a name for your application, select who can use this application, then choose Register.

An application will be created. You will see a page like the following screenshot.

  1. Note the values for Display name, Application (client) ID, and Directory (tenant) ID. These IDs will be different than what is shown in the screenshot.

Now you can configure the newly registered application with Microsoft Graph and SharePoint API permissions.

When configuring permissions, you have two different options:

  • Option 1 – Allow access to a specific set of SharePoint sites by granting the Selected permission
  • Option 2 – Allow access to all SharePoint sites by granting the FullControl.All permission

For option 1, install the MS Graph PowerShell SDK as a prerequisite.

Option 1: Manually allow access to specific SharePoint sites

If you choose option 1, to grant access to specific sites instead of all sites, you need to complete additional prerequisites.

Make sure you have access to another application in Microsoft Entra ID with Sites.FullControl.All application-level permissions, along with its client ID and client secret. This application won’t be used by the Amazon Q Business connector, but it’s needed to grant Sites.Selected permissions only to the application you just registered. If you don’t have access to an application with Sites.FullControl permissions, you can follow the previous steps to register a new application and grant Sites.FullControl as described in option 2. We refer to this application as SitesFullControlApp.

To configure your permissions using option 1, complete the following steps:

  1. In the navigation pane, choose API permissions.
  2. Choose the options menu (three dots) next to the permissions that were granted by default when you registered the application, and remove those permissions.

  1. Choose Add a permission and then choose Microsoft Graph.

  1. Choose Delegated permissions.

  1. Select Sites.Selected and choose Add permissions.

  1. Add the following Microsoft Graph API delegated permissions:
    1. GroupMember.Read.All
    2. User.Read.All

You will see the permissions listed as shown in the following screenshot.

  1. Some of the following permissions require admin consent in a tenant before it can be used. To grant admin consent, choose Grant admin consent for <organization name> and choose Yes to confirm.

After granting admin consent, your permissions should look like the following screenshot.

  1. To grant permissions to a specific SharePoint site, you’ll need to obtain the Site ID for that site.
    1. Visit the URL of the SharePoint site in your web browser. The URL will be in the format https://yourcompany.sharepoint.com/sites/{SiteName}.
    2. Log in to the site using valid
    3. Edit the URL in your browser’s address bar by appending /_api/site/id to the end of {SiteName}. For example, if the original URL was https://yourcompany.sharepoint.com/sites/HumanResources, modify it to https://yourcompany.sharepoint.com/sites/HumanResources/_api/site/id.
    4. Press Enter, and the browser will display the Site ID for that particular SharePoint site collection.

  1. Run the script after gathering the values listed in the following table.
Variable Description
AppName Display name that you captured earlier.
AppID Application (client) ID that you captured earlier.
SitesFullControlAppID Application (client) ID that was granted with Sites.FullControl.All. This is a prerequisite to have access to another application. This application won’t be used by the Amazon Q Business connector, but it’s needed to grant Sites.Selected permissions only to the application you plan to register.
SitesFullControlAppClientSecret Client secret of the SitesFullControlAppID you entered.
SiteID SharePoint Site ID.
TenantId Directory (tenant) ID that you captured earlier.
param(
  [Parameter(Mandatory = $true,
    HelpMessage = "The friendly name of the app registration")]
  [String]
  $AppName,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was registered")]
  [String]
  $AppID,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was granted with Sites.FullControl.All")]
  [String]
  $SitesFullControlAppID,

  [Parameter(Mandatory = $true,
    HelpMessage = "Client Secret of the APP ID that was granted with Sites.FullControl.All")]
  [string]
  $SitesFullControlAppClientSecret,

  [Parameter(Mandatory = $true,
    HelpMessage = "SharePoint Site ID")]
  [String]
  $SiteId,

  [Parameter(Mandatory = $true,
    HelpMessage = "Your Azure Active Directory tenant ID")]
  [String]
  $TenantId,

  [Parameter(Mandatory = $false)]
  [Switch]
  $StayConnected = $false
)

# You will get access token by logging into application that has Sites.Fullcontrol.All permissions and using the token, grant permissions to new application you created with Sites.Selected permission.

$Scope = "https://graph.microsoft.com/.default"
$TokenEndpoint = "https://login.microsoftonline.com/$TenantId/oauth2/v2.0/token"

# Body of the request

$body = @{
  grant_type    = "client_credentials"
  client_id     = $SitesFullControlAppID
  client_secret = $SitesFullControlAppClientSecret
  scope         = $Scope
}

# Get access token
$response = Invoke-RestMethod -Uri $TokenEndpoint -Method POST -Body $body 

# URL to grant permission to site
$url = "https://graph.microsoft.com/v1.0/sites/$SiteId/permissions"

# Define the body content as JSON string

$Body = @"
{
  "roles": ["fullcontrol"],
  "grantedToIdentities": [{
    "application": {
      "id": "$AppID",
      "displayName": "$AppName"
    }
  }]
}
"@

# Headers
$headers = @{
  "Authorization" = "Bearer $($response.access_token)"
  "Content-Type"  = "application/json"
}
$response = Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $Body

$response

The output from the PowerShell script will look like the following screenshot.

This completes the steps to configure permissions for a specific set of SharePoint site collections.

Option 2: Manually allow access to all SharePoint sites

Complete the following steps to allow full control permissions to all the SharePoint site collections:

  1. In the navigation pane, choose API permissions.
  2. Remove the permissions that were granted by default when you registered the application.
  3. Choose Add a permission and then choose Microsoft Graph.

  1. Choose Delegated permissions.

  1. Select FullControl.All and choose Add permissions.

Next, you configure Microsoft Graph application permissions.

  1. In the navigation pane, choose API permissions.
  2. Choose Add a permission and then choose Microsoft Graph.
  3. Choose Application permissions.

  1. Add the following application permissions:
    •  GroupMember.Read.All
    •  User.Read.All
    •  Notes.Read.All
    •  Sites.Read.All

Next, you configure SharePoint delegated permissions.

  1. In the navigation pane, choose API permissions.
  2. Choose Add a permission and then choose SharePoint.

  1. Choose Delegated permissions.
  2. Expand AllSites, select AllSites.Read, and choose Add permission.

You will find the permissions listed as shown in the following screenshot.

  1. Some of the following permissions require admin consent in a tenant before it can be used. To grant admin consent, choose Grant admin consent for <organization name> and choose Yes to confirm.

After granting admin consent, your permissions will look like the following screenshot.

This completes the steps to configure permissions to allow full control permissions to all the SharePoint site collections.

Create a client secret

Complete the following steps to create a client secret:

  1. In the navigation pane, choose Certificates & secrets.
  2. On the Client secrets tab, choose New client secret.
  3. Enter a value for Description and choose an expiration length Expires.
  4. Choose Add.

  1. Save the client secret value.

This value is needed while configuring Amazon Q. Client secret values can’t be viewed except for immediately after creation. Be sure to save the secret.

Deactivate multi-factor authentication

To deactivate multi-factor authentication (MFA), sign in to the Microsoft Entra Admin Center as security or global administrator and disable the security defaults.

  1. In the navigation pane, choose Identity and Overview.
  2. On the Properties tab, choose Manage security defaults.
  3. Choose Disabled.
  4. For Reason for disabling, select Other and enter your reason.
  5. Choose Save.

Configure Azure AD using the provided PowerShell script

  • Option 1: Grant access to specific SharePoint sites. This approach involves granting the Selected permission, which allows access to a specific set of SharePoint sites.
  • Option 2: Grant access to all SharePoint sites. This approach involves granting the FullControl.All permission, which allows access to all SharePoint sites in your organization.

When configuring permissions, consider your organization’s SharePoint access requirements. Many SharePoint admins prefer to grant Amazon Q Business access only to specific sites that need to be crawled, in which case Option 1 with the Sites.Selected permission would be suitable.

For either option, the user running the PowerShell script should be an Azure AD tenant admin or have tenant admin permissions. Additionally,  .

Run one of the provided PowerShell scripts, then follow the additional instructions. The scripts will perform the following tasks:

  • Register a new application in Azure AD/Entra ID
  • Configure the required permissions
  • Provide admin consent for the configured permissions

Option 1: Use a script to allow access to specific SharePoint sites

There is one additional prerequisite for option 1 (granting Sites.Selected permission): you need access to another application in Microsoft Entra ID that has the Sites.FullControl.All application-level permission. These are required to grant the Sites.Selected permission to the new application you will register. If you don’t have access to an application with the Sites.FullControl.All permission, you can follow the to register a new application and grant it the Sites.FullControl.All permission. This application will be referred to as SitesFullControlApp.

 Use the following script to grant permissions to a specific SharePoint site. You need the following information before running the script.

Variable Description
AppName Name of the application that you plan to register.
SitesFullControlAppID Application (client) ID that was granted with Sites.FullControl.All. This is a prerequisite to have access to another application. This application won’t be used by the Amazon Q Business connector, but it’s needed to grant Sites.Selected permissions only to the application you plan to register.
SitesFullControlAppClientSecret Client secret of the app ID you entered.
SiteID SharePoint Site ID.
TenantId Your Azure Active Directory tenant ID.
param(
  [Parameter(Mandatory = $true,
    HelpMessage = "The friendly name of the app registration")]
  [String]
  $AppName,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was registered")]
  [String]
  $AppID,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was granted with Sites.FullControl.All")]
  [String]
  $SitesFullControlAppID,

  [Parameter(Mandatory = $true,
    HelpMessage = "Client Secret of the APP ID that was granted with Sites.FullControl.All")]
  [string]
  $SitesFullControlAppClientSecret,

  [Parameter(Mandatory = $true,
    HelpMessage = "SharePoint Site ID")]
  [String]
  $SiteId,

  [Parameter(Mandatory = $true,
    HelpMessage = "Your Azure Active Directory tenant ID")]
  [String]
  $TenantId,

  [Parameter(Mandatory = $false)]
  [Switch]
  $StayConnected = $false
)

# You will get access token by logging into application that has Sites.Fullcontrol.All permissions and using the token, grant permissions to new application you created with Sites.Selected permission.

$Scope = "https://graph.microsoft.com/.default"
$TokenEndpoint = "https://login.microsoftonline.com/$TenantId/oauth2/v2.0/token"

# Body of the request

$body = @{
  grant_type    = "client_credentials"
  client_id     = $SitesFullControlAppID
  client_secret = $SitesFullControlAppClientSecret
  scope         = $Scope
}

# Get access token
$response = Invoke-RestMethod -Uri $TokenEndpoint -Method POST -Body $body 

# URL to grant permission to site
$url = "https://graph.microsoft.com/v1.0/sites/$SiteId/permissions"

# Define the body content as JSON string

$Body = @"
{
  "roles": ["fullcontrol"],
  "grantedToIdentities": [{
    "application": {
      "id": "$AppID",
      "displayName": "$AppName"
    }
  }]
}
"@

# Headers
$headers = @{
  "Authorization" = "Bearer $($response.access_token)"
  "Content-Type"  = "application/json"
}
$response = Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $Body

$response

The output from the PowerShell script will look like the following screenshot.

Note down the secret value shown in the output and then close the window for security. You will not able to retrieve this value again.

Option 2: Use a script to manually grant access to all SharePoint sites

The following script grants full control permissions to all the SharePoint site collections. You need the following information before running the script.

Variable Description
AppName The name of the application that you plan to register.
param(
  [Parameter(Mandatory=$true,
  HelpMessage="The friendly name of the app registration")]
  [String]
  $AppName,
  [Parameter(Mandatory=$false,
  HelpMessage="Your Azure Active Directory tenant ID")]
  [String]
  $TenantId,
  [Parameter(Mandatory=$false)]
  [Switch]
  $StayConnected = $false
)

# Requires an admin
if ($TenantId)
{
  Connect-MgGraph -Scopes "Application.ReadWrite.All User.Read AppRoleAssignment.ReadWrite.All  DelegatedPermissionGrant.ReadWrite.All" -TenantId $TenantId
}
else
{
  Connect-MgGraph -Scopes "Application.ReadWrite.All User.Read AppRoleAssignment.ReadWrite.All  DelegatedPermissionGrant.ReadWrite.All"
}
$SitePermissionAllSitesRead = "4e0d77b0-96ba-4398-af14-3baa780278f4"
$GraphPermissionsGroupMemberReadAll  = "98830695-27a2-44f7-8c18-0c3ebc9698f6"
$GraphPermissionsNotesReadAll = "3aeca27b-ee3a-4c2b-8ded-80376e2134a4"
$GraphPermissionsSitesReadAll = "332a536c-c7ef-4017-ab91-336970924f0d"
$GraphPermissionsUserReadAll = "df021288-bdef-4463-88db-98f22de89214"
$GraphPermissionsSitesFullControlAll = "5a54b8b3-347c-476d-8f8e-42d5c7424d29"

# Sharepoint permissions 
$sharePointResourceId = "00000003-0000-0ff1-ce00-000000000000"
$SitePermission = @(
  @{
  Id= $SitePermissionAllSitesRead  #AllSites.Read (Delegated) – Read items in all site collections
  Type="Scope"
}
)

# Graph permissions 
$graphResourceId =  "00000003-0000-0000-c000-000000000000"
$graphPermissions = @(
    @{
        Id =  $GraphPermissionsGroupMemberReadAll  # GroupMember.Read.All (Application)
        Type = "Role"
    },
    @{
        Id = $GraphPermissionsNotesReadAll # Notes.Read.All (Application)
        Type = "Role"
    },
    @{
        Id = $GraphPermissionsSitesReadAll # Sites.Read.All (Application)
        Type = "Role"
    },
    @{
        Id =  $GraphPermissionsUserReadAll # User.Read.All (Application)
        Type = "Role"
    },
     @{
        Id = $GraphPermissionsSitesFullControlAll # Sites.FullControl.All (Delegated)
        Type = "Scope"
    }
)


$requiredResourceAccess = @()

$graphResourceAccess   = @{
ResourceAppId=$graphResourceId
ResourceAccess= $graphPermissions
}

$spResourceAccess = @{
    ResourceAppId = $sharePointResourceId
    ResourceAccess = $SitePermission
  }

$requiredResourceAccess += $spResourceAccess
$requiredResourceAccess += $graphResourceAccess


# Get context for access to tenant ID
$context = Get-MgContext

# Create app registration
$appRegistration = New-MgApplication -DisplayName $AppName -SignInAudience "AzureADMyOrg" `
 -Web @{ RedirectUris="http://localhost"; } `
 -RequiredResourceAccess   $requiredResourceAccess `
 -AdditionalProperties @{}
Write-Host -ForegroundColor Cyan "App registration created with app ID" $appRegistration.AppId

# Add client secret
#$clientSecret = [System.Net.WebUtility]::UrlEncode(([System.Text.Encoding]::UTF8.GetBytes((New-Guid).ToString() + "abcdefghijklmnopqrstuvwxyz0123456789")))
$clientSecretCredential = Add-MgApplicationPassword -ApplicationId $appRegistration.Id -PasswordCredential @{ displayName  = "Client Secret"; EndDateTime = (Get-Date).AddYears(2) } 
Write-Host -ForegroundColor Cyan "Client secret created "

$secretValue = $clientSecretCredential.SecretText
Write-Host  -ForegroundColor  Red "Secret Text is [$secretValue]"
Write-Host  -ForegroundColor  Red  "Please Clear the screen after noting down the Secret value."
#$clientSecretCredential |  Format-List

# Create corresponding service principal
$servicePrincipal= New-MgServicePrincipal -AppId $appRegistration.AppId -AdditionalProperties @{} | Out-Null
Write-Host -ForegroundColor Cyan "Service principal created"
Write-Host
Write-Host -ForegroundColor Green "Success"
Write-Host

#Admin consent
$scp = Get-MgServicePrincipal -Filter "DisplayName eq '$($AppName)'" 
$app = Get-MgServicePrincipal -Filter "AppId eq '$graphResourceId'" 

New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId  $GraphPermissionsGroupMemberReadAll
New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId $GraphPermissionsNotesReadAll
New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId $GraphPermissionsSitesReadAll
New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId $GraphPermissionsUserReadAll
New-MgOAuth2PermissionGrant -ClientId $scp.id  -consentType "AllPrincipals"  -resourceId $app.Id  -Scope "Sites.FullControl.All"


if ($StayConnected -eq $false)
{
  Disconnect-MgGraph
  Write-Host "Disconnected from Microsoft Graph"
}
else
{
  Write-Host
  Write-Host -ForegroundColor Yellow "The connection to Microsoft Graph is still active. To disconnect, use Disconnect-MgGraph"
}

The output from the PowerShell script will look like the following screenshot.

Note down the secret value shown in the output and then close the window for security. You will not able to retrieve this value again.

Configure Amazon Q

Make sure you have set up Amazon Q Business with Entra ID as your identity provider as mentioned in the prerequisites. Also, make sure the email ID is in lowercase letters while creating the user in Entra ID.

Follow the instructions in Connecting Amazon Q Business to SharePoint (Online) using the console.

For Step 9 (Authentication), we choose Oauth 2.0 and configure it as follows:

  1. For Tenant ID, enter the tenant ID of your SharePoint account.

This is the directory (tenant) ID in your registered Azure application, in the Azure Portal, as shown in the following screenshot (the IDs will be different for your setup).

  1. For the AWS Secrets Manager secret, create a secret on the Secrets Manager console to store your SharePoint authentication credentials:
  2. For Secret name, enter a name for your secret.
  3. For Username, enter user name for your SharePoint account.
  4. For Password, enter password for your SharePoint account.
  5. For Client ID, enter the Azure AD client ID generated when you registered SharePoint in Azure AD. This is the application (client) ID created in the Azure Portal when registering the SharePoint application in Azure, as described earlier.
  6. For Client secret, enter the secret generated earlier.

Frequently asked questions

In this section, we discuss some frequently asked questions.

Amazon Q Business isn’t answering questions

There are a few possible scenarios for this issue. If no users are getting a response from a specific document, verify that you have synced your data source with Amazon Q. Choose View report in the Sync run history section. For more information, see Introducing document-level sync reports: Enhanced data sync visibility in Amazon Q Business.

If a specific user is unable to access, verify that their email address in SharePoint matches with the email address of the corresponding identity in IAM Identity Center and entered in lowercase in IAM Identity Center. For more information, refer to Known limitations for the SharePoint (Online) connector.

For troubleshooting purposes, you can use the

Amazon Q Business isn’t answering any questions that are in the event attachment or comments in the SharePoint site

The connector crawls event attachments only when Events is also chosen as an entity to be crawled. Make sure that you chose the corresponding entities in the sync scope.

Error message that authentication failed

In some cases, you might get the error message,  “Sharepoint Connector Error code: SPE-5001 Error message: Authentication failed:” when trying to sync.

To address this, validate that the user name, password, clientId, clientSecret, and authType values are correct in the secret that you created for this connector. Verify that MFA is deactivated.

Amazon Q Business is showing old data after an update

After the content has been updated on SharePoint, you must re-sync the contents for the updated data to be picked up by Amazon Q. Go to the data sources, select the SharePoint data source, and choose Sync now. After the sync is complete, verify that the updated data is reflected by running queries on Amazon Q.

Unable to sign in as a new user through the web experience URL

If you experience issues when signing in, clear your browser cookies and sign in as a new user.

Error message that the application is not set up correctly

Verify that the user or group has subscriptions to Amazon Q Business. Check the corresponding user group and then choose Manage access and subscriptions and select the corresponding subscription.

Error message when uploading a file

In some cases, users might get the following message when they upload a file through their user experience: “Your Amazon Q Business subscription doesn’t include file uploads. Please contact your administrator for assistance.”

Troubleshooting

For troubleshooting guidance, refer to Troubleshooting your SharePoint (Online) connector.

Clean up

Complete the following steps to clean up your resources:

  1. Open the Office 365 Admin Center using the account of a user member of the Tenant Global Admins group.
  2. Navigate to the Microsoft Azure Portal.
  3. Search for and choose App registrations.
  4. Select the app you created earlier, then choose Delete.
  5. On the Amazon Q Business console, choose Applications in the navigation pane.
  6. Select the application you created, and on the Actions menu, choose Delete.

Additional capabilities of Amazon Q Business

Amazon Q Business offers much more than just a powerful AI assistant. Explore its other capabilities that allow you to customize the user experience, empower your workforce, and increase productivity:

  • Admin controls and guardrails – Customize your application environment to your organizational needs. Amazon Q Business offers application environment guardrails or chat controls that you can configure to control the end-user chat experience. For example, admins can define specific topics that should be blocked or controlled in the application. You can customize how Amazon Q Business responds when these topics are mentioned by end-users.
  • Amazon Q Apps – Empower your teams to build lightweight, purpose-built applications that streamline tasks and workflows without coding experience. For example, you could build an app that drafts personalized sales emails to customers informing them of new product launches or generates social media content for specified social media networks based on your data.
  • Plugins for Amazon Q Business – Seamlessly integrate with supported third-party services that allow you to perform specific tasks like creating an incident ticket in ServiceNow or raising an issue in Jira—all without leaving the Amazon Q interface.

Conclusion

In this post, we explored how to integrate Amazon Q Business with SharePoint Online using the OAuth 2.0 ROPC flow authentication method. We provided both manual and automated approaches using PowerShell scripts for configuring the required Azure AD settings. Additionally, we demonstrated how to enter those details along with your SharePoint authentication credentials into the Amazon Q console to finalize the secure connection.

The ROPC flow offers an alternative to certificate-based authentication for connecting Amazon Q Business to SharePoint Online. This can be useful when you want Amazon Q Business to crawl through OneNote or if you don’t want to deal with certificates or in scenarios that require regular password rotation.

By following this post, enterprises can take advantage of the powerful knowledge mining capabilities of Amazon Q to unlock insights from their SharePoint data repositories and knowledge bases.


About the Author

Ramesh Eega is a Global Accounts Solutions Architect based out of Atlanta, GA. He is passionate about helping customers throughout their cloud journey.

Read More

John Snow Labs Medical LLMs are now available in Amazon SageMaker JumpStart

John Snow Labs Medical LLMs are now available in Amazon SageMaker JumpStart

Today, we are excited to announce that John Snow Labs’ Medical LLM – Small and Medical LLM – Medium large language models (LLMs) are now available on Amazon SageMaker Jumpstart. Medical LLM is optimized for the following medical language understanding tasks:

  • Summarizing clinical encounters – Summarizing discharge notes, progress notes, radiology reports, pathology reports, and various other medical reports
  • Question answering on clinical notes or biomedical research – Answering questions about a clinical encounter’s principal diagnosis, test ordered, or a research abstract’s study design or main outcomes

For medical doctors, this tool provides a rapid understanding of a patient’s medical journey, aiding in timely and informed decision-making from extensive documentation. This summarization capability not only boosts efficiency but also makes sure that no critical details are overlooked, thereby supporting optimal patient care and enhancing healthcare outcomes.

In a blind evaluation performed by the John Snow Labs research team, Medical LLM – Small outperformed GPT-4o in medical text summarization, being preferred by doctors 88% more often for factuality, 92% more for clinical relevance, and 68% more for conciseness. The model also excelled in clinical notes question answering, preferred 46% more for factuality, 50% more for relevance, and 44% more for conciseness. In biomedical research question answering, the model was preferred even more dramatically: 175% for factuality, 300% for relevance, and 356% for conciseness. Notably, despite being smaller than competitive models by more than an order of magnitude, the small model performed comparably in open-ended medical question answering tasks.

Medical LLM in SageMaker JumpStart is available in two sizes: Medical LLM – Small and Medical LLM – Medium. The models are deployable on commodity hardware, while still delivering state-of-the-art accuracy. This is significant for medical professionals who need to process millions to billions of patient notes without straining computing budgets.

Both models support a context window of 32,000 tokens, which is roughly 50 pages of text. You can try out the models with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. In this post, we walk through how to discover and deploy Medical LLM – Small using SageMaker JumpStart.

About John Snow Labs

John Snow Labs, the AI for healthcare company, provides state-of-the-art software, models, and data to help healthcare and life science organizations put AI to good use. John Snow Labs is the developer behind Spark NLP, Healthcare NLP, and Medical LLMs. Its award-winning medical AI software powers the world’s leading pharmaceuticals, academic medical centers, and health technology companies. John Snow Labs’ Medical Language Models is by far the most widely used natural language processing (NLP) library by practitioners in the healthcare space (Gradient Flow, The NLP Industry Survey 2022 and the Generative AI in Healthcare Survey 2024).

John Snow Labs’ state-of-the-art AI models for clinical and biomedical language understanding include:

  • Medical language models, consisting of over 2,400 pre-trained models for analyzing clinical and biomedical text
  • Visual language models, focused on understanding visual documents and forms
  • Peer-reviewed, state-of-the-art accuracy on a variety of common medical language understanding tasks
  • Tested for robustness, fairness, and bias

What is SageMaker JumpStart

With SageMaker JumpStart, you can choose from a broad selection of publicly available foundation models (FMs). ML practitioners can deploy FMs to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment. You can now discover and deploy a Medical LLM – Small model with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and machine learning operations (MLOps) controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your virtual private cloud (VPC) controls, helping provide data security. The Medical LLM – Small model is available today for deployment and inference in SageMaker Studio.

Discover the Medical LLM – Small model in SageMaker JumpStart

You can access the FMs through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

From the SageMaker JumpStart landing page, you can discover various models by browsing through different hubs, which are named after model providers. You can find the Medical LLM – Small model in the John Snow Labs hub (see the following screenshot). If you don’t see the Medical LLM – Small model, update your SageMaker Studio version by shutting down and restarting. For more information, refer to Shut down and Update Studio Classic Apps.

You can also find the Medical LLM – Small model by searching for “John Snow Labs” in the search field.

You can choose the model card to view details about the model such as license, data used to train, and how to use the model. You will also find two options to deploy the model, Deploy and Preview notebooks, which will deploy the model and create an endpoint.

Subscribe to the Medical LLM – Small model in AWS Marketplace

This model requires an AWS Marketplace subscription. When you choose Deploy in SageMaker Studio, you will be prompted to subscribe to the AWS Marketplace listing if you don’t already have it. If you are already subscribed, choose Deploy.

If you don’t have an active AWS Marketplace subscription, choose Subscribe. You will be redirected to the listing on AWS Marketplace. Review the terms and conditions and choose Accept offer.

After you’ve successfully subscribed to the model on AWS Marketplace, you can now deploy the model in SageMaker JumpStart.

Deploy the Medical LLM – Small model in SageMaker JumpStart

When you choose Deploy in SageMaker Studio, deployment will start.

You can monitor the progress of the deployment on the endpoint details page that you’re redirected to.

On the same endpoint details page, on the Test inference tab, you can send a test inference request to a deployed model. This is useful if you want to verify that your endpoint responds to requests as expected. The following prompt asks a question to the Medical LLM – Small model with the context followed by the question and checks the resulting response. Performance metrics, such as execution length time, is also included.

You can also test out the medical text summarization response.

Deploy the model and run inference through a notebook

Alternatively, you can choose Open in JupyterLab to deploy the model through the example notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources. You can configure additional parameters as needed, but SageMaker JumpStart enables you to deploy and run inference out of the box with the included code.

The notebook already has the necessary code to deploy the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To learn more, refer to the API documentation.

After you deploy the model, you can run real-time or batch inference against the deployed endpoint. The notebook includes example code and instructions for both.

Clean up

After you’re done running the notebook, delete all resources that you created in the process.

When deploying the endpoint from the SageMaker Studio console, you can delete it by choosing Delete on the endpoint details page.

If you want to unsubscribe to the model package completely, you need to unsubscribe from the product in AWS Marketplace:

  1. Navigate to the Machine Learning tab on your software subscriptions page.
  2. Locate the listing that you want to cancel the subscription for, then choose Cancel Subscription to cancel the subscription.

Complete these cleanup steps to avoid continued billing for the model.

Conclusion

In this post, we showed you how to get started with the first healthcare-specific model available now in SageMaker JumpStart. Check out SageMaker JumpStart in SageMaker Studio now to get started. To learn more, refer to the following resources:


About the Authors

Art Tuazon is a Solutions Architect on the CSC team at AWS. She supports both AWS Partners and customers on technical best practices. In her free time, she enjoys running and cooking.

Beau Tse is a Partner Solutions Architect at AWS. He focuses on supporting AWS Partners through their partner journey and is passionate about enabling them on AWS. In his free time, he enjoys traveling and dancing.

David headshotDavid Talby is the Chief Technology Officer at John Snow Labs, helping companies apply artificial intelligence to solve real-world problems in healthcare and life science. He was named USA CTO of the Year by the Global 100 Awards and Game Changers Awards in 2022.

Read More

Accelerating Mixtral MoE fine-tuning on Amazon SageMaker with QLoRA

Accelerating Mixtral MoE fine-tuning on Amazon SageMaker with QLoRA

Companies across various scales and industries are using large language models (LLMs) to develop generative AI applications that provide innovative experiences for customers and employees. However, building or fine-tuning these pre-trained LLMs on extensive datasets demands substantial computational resources and engineering effort. With the increase in sizes of these pre-trained LLMs, the model customization process becomes complex, time-consuming, and often prohibitively expensive for most organizations that lack the necessary infrastructure and skilled talent.

In this post, we demonstrate how you can address these challenges by using fully managed environment with Amazon SageMaker Training jobs to fine-tune the Mixtral 8x7B model using PyTorch Fully Sharded Data Parallel (FSDP) and Quantized Low Rank Adaptation (QLoRA).

We guide you through a step-by-step implementation of model fine-tuning on a GEM/viggo dataset, employing the QLoRA fine-tuning strategy on a single p4d.24xlarge worker node (providing 8 Nvidia A100 40GB GPUs).

Business challenge

Today’s businesses are looking to adopt a variety of LLMs to enhance business applications. Primarily, they’re looking for foundation models (FMs) that are open source (that is, model weights that work without modification from the start) and can offer computational efficiency and versatility. Mistral’s Mixtral 8x7B model, released with open weights under the Apache 2.0 license, is one of the models that has gained popularity with large enterprises due to the high performance that it offers across various tasks. Mixtral employs a sparse mixture of experts (SMoE) architecture, selectively activating only a subset of its parameters for each input during model training. This architecture allows these models to use only 13B (about 18.5%) of its 46.7B total parameters during inference, making it high performing and efficient.

These FMs work well for many use cases but lack domain-specific information that limits their performance at certain tasks. This requires businesses to use fine-tuning strategies to adapt these large FMs to specific domains, thus improving performance on targeted applications. Due to the growing number of model parameters and the increasing context lengths of these modern LLMs, this process is memory intensive and requires advanced AI expertise to align and optimize them effectively. The cost of provisioning and managing the infrastructure increases the overall cost of ownership of the end-to-end solution.

In the upcoming section, we discuss how you can cost-effectively build such a solution with advanced memory optimization techniques using Amazon SageMaker.

Solution overview

To address the memory challenges of fine-tuning LLMs such as Mixtral, we will adopt the QLoRA method. As shown in the following diagram, QLoRA freezes the original model’s weights and adds low-rank trainable parameters to the transformer layers. QLoRA further uses quantization to represent the actual model’s weights in a compact, optimized format such as 4-bit NormalFloat (NF4), effectively compressing the model and reducing its memory footprint. This enables training and fine-tuning these LLMs even on systems with limited memory while maintaining performance comparable to half-precision fine-tuning. QLoRA’s support for double quantization and paged optimizers reduces the memory footprint further by quantizing the quantization constants and effectively handling any sudden memory demands.

During the forward pass computation of this architecture, the 4-bit weights get dequantized to bfloat16 (BF16) precision. On the other hand, the LoRA adapters continue to operate on BF16 precision data. Both (original weights and adapter output vectors) are then added together element-wise to produce the final result, denoted as h.

During the backward pass of the model, the gradients are computed with respect to only the LoRA parameters, not the original base model weights. Although the dequantized original weights are used in calculations, the original 4-bit quantized weights of the base model remain unchanged.

To adopt the following architecture, we will use the Hugging Face Parameter-Efficent Fine-tuning (PEFT) library, which integrates directly with bitsandbytes. This way, the QLoRA technique to fine-tune can be adopted with just a few lines of code.

QLoRA operates on a large FM. In the figure below, X denotes the input tokens of the training data, W is the existing model weights (quantized), and Wa, Wb are the segments of the adapters added by QLoRA. The original model’s weights (W) are frozen, and QLoRA adds adapters (Wa, Wb), which are low-rank trainable parameters, onto the existing transformer layer.

QLoRA explanation showing adapters added onto the existing transformer layer

Figure 1: This figure shows how QLoRA operates. The original model’s weights (W) are frozen, and QLoRA adds in adapters (Wa, Wb) onto the existing transformer layer.

Although QLoRA helps optimize memory during fine-tuning, we will use Amazon SageMaker Training to spin up a resilient training cluster, manage orchestration, and monitor the cluster for failures. By offloading the management and maintenance of the training cluster to SageMaker, we reduce both training time and our total cost of ownership (TCO). Using this approach, you can focus on developing and refining the model while using the fully managed training infrastructure provided by SageMaker Training.

Implementation details

We spin up the cluster by calling the SageMaker control plane through APIs or the AWS Command Line Interface (AWS CLI) or using the SageMaker AWS SDK. In response, SageMaker spins up training jobs with the requested number and type of compute instances. In our example, we use one ml.p4d.24xlarge compute instance.

To take complete advantage of this multi-GPU cluster, we use the recent support of QLoRA and PyTorch FSDP. Although QLoRA reduces computational requirements and memory footprint, FSDP, a data/model parallelism technique, will help shard the model across all eight GPUs (one ml.p4d.24xlarge), enabling training the model even more efficiently. Hugging Face PEFT is where the integration happens, and you can read more about it in the PEFT documentation.

QLoRA adapters are added to the linear layers in the model. The layers (for example, transformer layers, gate networks, and feed-forward networks) put together will form the entire model, as shown in the following diagram, which will be considered to be sharded by FSDP across our cluster (shown as small shards in blue).

The following architecture diagram shows how you can use SageMaker Training to have the SageMaker Control Plane spin up a resilient training job cluster. SageMaker downloads the training image from Amazon Elastic Container Registry (Amazon ECR) and will use Amazon Simple Storage Service (Amazon S3) as an input training data source and to store training artifacts.

Architecture Diagram

Figure 3: Architecture Diagram showing how you can utilize SageMaker Training Jobs to spin up a resilient training cluster. Amazon ECR contains the training image, and Amazon S3 contains the training artifacts.

To put this solution into practice, execute the following use case.

Prerequisites

To perform the solution, you need to have the following prerequisites in place:

  1. Create a Hugging Face User Access Token and get access to the gated repo mistralai/Mixtral-8x7B-v0.1 on Hugging Face.
  2. (Optional) Create a Weights & Biases API key to access the Weights & Biases dashboard for logging and monitoring. This is recommended if you’d like to visualize model training specific metrics.
  3. Request a service quota at Service Quotas for 1x ml.p4d.24xlarge on Amazon SageMaker. To request a service quota increase, on the AWS Service Quotas console, navigate to AWS services, Amazon SageMaker, and choose ml.p4d.24xlarge for training job usage.
  4. Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess and AmazonEC2FullAccess to give required access to SageMaker to run the examples.

This role is for demonstration purposes only. You need to adjust it to your specific security requirements for production. Adhere to the principle of least privilege while defining IAM policies in production.

  1. (Optional) Create an Amazon SageMaker Studio domain (see Quick setup to Amazon SageMaker) to access Jupyter notebooks with the preceding role. (You can use JupyterLab in your local setup too)
  2. Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets.
$ git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git
$ cd 15_mixtral_finetune_qlora

The 15_mixtral_finetune_qlora directory contains the training scripts that you might need to deploy this sample.

Next, we will run the finetune-mixtral.ipynb notebook to fine-tune the Mixtral 8x7B model using QLoRA on SageMaker. Check out the notebook for more details on each step. In the next section, we walk through the key components of the fine-tuning execution.

Solution walkthrough

To perform the solution, follow the steps in the next sections.

Step 1: Set up required libraries

Install the relevant HuggingFace and SageMaker libraries:

!pip install transformers "datasets[s3]==2.18.0" "sagemaker>=2.190.0" "py7zr" "peft==0.12.0" --upgrade –quiet

Step 2: Load dataset

In this example, we use the GEM/viggo dataset from Hugging Face. This is a data-to-text generation dataset in the video game domain. The dataset is clean and organized with about 5,000 data points, and the responses are more conversational than information seeking. This type of dataset is ideal for extracting meaningful information from customer reviews. For example, an ecommerce application such as Amazon.com could use a similarly formatted dataset for fine-tuning a model for natural language processing (NLP) analysis to gauge interest in products sold. The results can be used for recommendation engines. Thus, this dataset is a good candidate for fine-tuning LLMs. To learn more about the viggo dataset, check out this research paper.

Load the dataset and convert it to the required prompt structure. The prompt is constructed with the following elements:

  • Target sentence – Think of this as the final review. In the dataset, this is target.
  • Meaning representation – Think of this as a deconstructed review, broken down by attributes such as inform, request, or give_opinion. In the dataset, this is meaning_representation.

Running the following cell gives us the train_set and test_set (training split and testing split, respectively) with structured prompts. We use the Python map function to structure the dataset splits according to our prompt.

def generate_and_tokenize_prompt(data_point):
    full_prompt = f"""
      Given a target sentence, construct the underlying 
      meaning representation ...
      ['inform', 'request', 'give_opinion', 'confirm', 
      'verify_attribute', 'suggest', 'request_explanation', 
      'recommend', 'request_attribute']

      The attributes must be one of the following:
      ['name', 'exp_release_date', 'release_year', 
      'developer', 'esrb', 'rating', 'genres', 
      'player_perspective', 'has_multiplayer', 'platforms', 
      'available_on_steam', 'has_linux_release', 
      'has_mac_release', 'specifier']

      ### Target sentence:
      {data_point["target"]}

      ### Meaning representation:
      {data_point["meaning_representation"]}
    """
    return {"prompt": full_prompt.strip()}

# Load dataset from the HuggingFace hub
train_set = load_dataset(dataset_name, split="train")
test_set = load_dataset(dataset_name, split="test")

# Add system message to each conversation
columns_to_remove = list(dataset["train"].features)

train_dataset = train_set.map(
  generate_and_tokenize_prompt,
  remove_columns=columns_to_remove,
  batched=False
)

test_dataset = test_set.map(
  generate_and_tokenize_prompt,
  remove_columns=columns_to_remove,
  batched=False
)

Upload the dataset to Amazon S3. This step is crucial because the dataset stored in Amazon S3 will serve as the input data channel for the SageMaker training cluster. SageMaker will efficiently manage the process of distributing this data across the training cluster, allowing each node to access the necessary information for model training.

input_path = f's3://{sess.default_bucket()}/datasets/mixtral'

# Save datasets to s3
train_dataset.to_json(f"{input_path}/train/dataset.json", orient="records")
train_dataset_s3_path = f"{input_path}/train/dataset.json"
test_dataset.to_json(f"{input_path}/test/dataset.json", orient="records")
test_dataset_s3_path = f"{input_path}/test/dataset.json"

We analyze the distribution of prompt tokens to determine the maximum sequence length required for training our model in the upcoming steps.

The following graph shows the prompt tokens plotted. The x-axis is the length of the prompts, and the y-axis is the number of times that length occurs in the training dataset (frequency). We use this to determine the maximum sequence length and pad the rest of the data points accordingly. The maximum number of words in our example is 173.

Input Tokens Distribution

Figure 4: Graph showing the distribution of input token lengths prompted. The x-axis shows the lengths and the y-axis shows the frequency with which those input token lengths occur in the train and test dataset splits.

Step 3: Configure the parameters for SFTTrainer for the fine-tuning task

We use TrlParser to parse hyperparameters in a YAML file that is required to configure SFTTrainer API for fine-tuning the model. This approach offers flexibility because we can also overwrite the arguments specified in the config file by explicitly passing them through the command line interface.

cat > ./args.yaml <<EOF
model_id: "mistralai/Mixtral-8x7B-v0.1" # Hugging Face model id
max_seq_length: 2048 # based in prompt length distribution graph
train_dataset_path: "/opt/ml/input/data/train/" # path to where SageMaker saves train dataset
test_dataset_path: "/opt/ml/input/data/test/" # path to where SageMaker saves test dataset
output_dir: "/opt/ml/model/mixtral/adapter" # path to where SageMaker will upload the model
...

num_train_epochs: 1 # number of training epochs
per_device_train_batch_size: 10 # batch size per device during training
gradient_accumulation_steps: 1 # number of steps before performing a backward/update pass
optim: adamw_torch # use torch adamw optimizer
...

bf16: true # use bfloat16 precision
tf32: true # use tf32 precision
gradient_checkpointing: true # use gradient checkpointing to save memory

# offload FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdp
fsdp: "full_shard auto_wrap" # remove offload if enough GPU memory
fsdp_config:
  backward_prefetch: "backward_pre"
  forward_prefetch: "false"
  use_orig_params: "false"

Step 4: Review the launch script

You are now prepared to fine-tune the model using a combination of PyTorch FSDP and QLoRA. We’ve prepared a script called launch_fsdp_qlora.py that will perform the tasks mentioned in the following steps. The following is a quick review of the key points in this script before launching the training job.

  1. Load the dataset from a JSON file located at the specified path, using the load_dataset function to prepare it for model training.
# Load datasets
train_dataset = load_dataset(
  "json",
  data_files=os.path.join(script_args.train_dataset_path, 
  "dataset.json"),
  split="train",
)
  1. Prepare the tokenizer and the model.

We employ the BitsAndBytes library to configure 4-bit quantization settings for our model, enabling memory-efficient loading and computation.

By setting parameters such as load_in_4bit and bnb_4bit_use_double_quant to True, we enable a dramatic reduction in model size without significant loss in performance. The nf4 quantization type, coupled with bfloat16 compute and storage data types, allows for nuanced control over the quantization process, striking an optimal balance between model compression and accuracy preservation. This configuration enables the deployment of massive models on resource-constrained hardware, making advanced AI more accessible and practical for a wide range of applications.

# Configure model quantization
torch_dtype = torch.bfloat16
quant_storage_dtype = torch.bfloat16

# Configures 4-bit quantization settings for the model
quantization_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_use_double_quant=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch_dtype,
  bnb_4bit_quant_storage=quant_storage_dtype,
)

model_loading_params = {
  "quantization_config": quantization_config,
  "torch_dtype": quant_storage_dtype,
  "use_cache": False if 
  training_args.gradient_checkpointing else True
}

# Loads a pre-trained model from the specified model ID
model = AutoModelForCausalLM.from_pretrained(
  script_args.model_id,
  cache_dir="/opt/ml/sagemaker/warmpoolcache",
  **model_loading_params
)
  1. Initiate the training process using SFTTrainer from the Transformer Reinforcement Learning (TRL) library to fine-tune the model. The SFTTrainer simplifies the process of supervised fine-tuning for LLMs. This approach makes fine-tuning efficient to adapt pre-trained models to specific tasks or domains.

We use the LoraConfig class from the Hugging Face’s PEFT library to configure and add LoRA parameters (also called “adapters”) to the model.

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
  lora_alpha=8,
  lora_dropout=0.05,
  r=16,
  ...
)

################
# Training
################
trainer = SFTTrainer(
  model=model,
  args=training_args,
  train_dataset=train_dataset,
  eval_dataset=test_dataset,
  peft_config=peft_config,
  max_seq_length=script_args.max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  ...
)

trainer.train(resume_from_checkpoint=checkpoint)

Step 5: Fine-tune your model

To fine-tune your model, follow the steps in the next sections.

Launch the training job

You are now ready to launch the training. We use the SageMaker Training estimator, which uses torchrun to initiate distributed training.

The SageMaker estimator simplifies the training process by automating several key tasks in this example:

  1. The SageMaker estimator spins up a training cluster of one ml.p4d.24xlarge instance. SageMaker handles the setup and management of these compute instances, which reduces your TCO.
  2. This estimator also uses one of the pre-built containers managed by SageMaker, PyTorch, which includes an optimized compiled version of the PyTorch framework and its required dependencies and GPU-specific libraries for accelerated computations.
pytorch_estimator = PyTorch(
  entry_point= 'launch_fsdp_qlora.py',
  source_dir="./scripts",
  ...
  framework_version="2.2.0",
  py_version="py310",
  instance_count=1,
  instance_type="ml.p4d.24xlarge",
  sagemaker_session=sess,
  disable_output_compression=True,
  keep_alive_period_in_seconds=1800,
  distribution={"torch_distributed": {"enabled": True}},
  hyperparameters={
    "config": "/opt/ml/input/data/config/args.yaml" #path to 
    TRL config which was uploaded to s3
  }
)

The training process generates trained adapters that will be saved in a default S3 bucket named sagemaker-<region name>-<account_id> for this job.

Monitor your training run

You can monitor training metrics, such as loss, and learning rate for your training run through the Weights & Biases Dashboard. The following figures show the results of the training run, where we track GPU utilization and GPU memory utilization.

The example is optimized to use GPU memory to its maximum capacity. Note that increasing the batch size any further will lead to CUDA Out of Memory errors.

The following graph shows the GPU memory utilization (for all eight GPUs) during the training process. You can also observe the GPU memory utilization for any given point in time.

GPU Memory Utilization

Figure 5: This graph shows the GPU Memory utilization plotted for all 8 GPUs in the training job.

The following graph shows the GPU compute utilization (for all eight GPUs) during the training process. You can also observe the GPU memory utilization for any given point in time.

GPU Compute Utilization

Figure 6: This graph shows the GPU Compute utilization plotted for all 8 GPUs in the training job.

Step 6: Merge the trained adapter with the base model for inference

Merge the training LoRA adapter with the base model. After the merge is complete, run inference to find the results. Specifically, look at how the new fine-tuned and merged model performs compared to the original unmodified Mixtral-8x7b model. The example does the adapter merge and inference both in the same launch script “merge_model_adapter.py.”

Before launching the training job, review the key components of the merge script:

Use the Hugging Face Transformers library. Specifically, use AutoModelForCausalLM to load a PEFT model from a specified HuggingFace model directory (mistralai/Mixtral-8x7B-v0.1). We have configured this library to have a low CPU memory utilization (low_cpu_mem_usage=True) to reduce the CPU to GPU communication overhead, and we’ve also used automatic device mapping (device_map="auto") while offloading the model to a designated folder to manage resource constraints.

# Load a Peft model
base_model = AutoModelForCausalLM.from_pretrained(
  model_id,
  low_cpu_mem_usage=True,
  #torch_dtype=torch.float16,
  device_map="auto",
  offload_folder="/opt/ml/model/"
)

# Load the adapter
peft_model = PeftModel.from_pretrained(
  base_model,
  adapter_dir,
  #torch_dtype=torch.float16,  # Set dtype to float16
  offload_folder="/opt/ml/model/"
)

# Merge the base model with the trained adapter
model = peft_model.merge_and_unload()
print("Merge done")

After the model is merged, send inference requests to generate responses.

def generate_text(model, prompt, max_length=500, num_return_sequences=1):
    ...

    input_ids = tokenizer.encode(prompt_input, 
    return_tensors="pt").to(device)

    # Generate text
    with torch.no_grad():
    output = model.generate(
      input_ids,
      max_length=max_length,
      num_return_sequences=num_return_sequences,
      no_repeat_ngram_size=2,
      top_k=50,
      top_p=0.95,
      temperature=0.7
    )

    # Decode and return the generated text
    generated_texts = [tokenizer.decode(seq, 
    skip_special_tokens=True) for seq in output]

    return generated_texts

print(f"nnn*** Generating Inference on Base Model: {generate_text(base_model,prompt)}nnn")

print(f"***nnn Generating Inference on Trained Model: {generate_text(model,prompt)}nnn")

Step 7: Launch the SageMaker training job to merge the adapter

Run the following script as part of the SageMaker training job.

First, explore the adapters that were saved as part of the training run.

adapter_dir_path=f"{model_artifacts}/mixtral/adapter/"

print(f'nAdapter S3 Dir path:{adapter_dir_path} n')

!aws s3 ls {adapter_dir_path}

# Reference Output
Adapter S3 Dir path:s3://sagemaker-<Region>-<Account-ID>/mixtral-8-7b-finetune-2024-09-08-22-27-42-099/output/model/mixtral/adapter/

PRE checkpoint-64/
PRE runs/
2024-09-08 23:08:07       5101 README.md
2024-09-08 23:07:58        722 adapter_config.json
2024-09-08 23:08:06  969174880 adapter_model.safetensors
2024-09-08 23:08:08        437 special_tokens_map.json
2024-09-08 23:08:04    1795596 tokenizer.json
2024-09-08 23:08:04        997 tokenizer_config.json
2024-09-08 23:08:04       5688 training_args.bin

Create and run the PyTorch estimator to configure the training job.

pytorch_estimator_adapter = PyTorch(
  entry_point= 'merge_model_adapter.py',
  source_dir="./scripts",
  job_name=job_name,
  base_job_name=job_name,
  max_run=5800,
  role=role,
  framework_version="2.2.0",
  py_version="py310",
  instance_count=1,
  instance_type="ml.p4d.24xlarge",
  sagemaker_session=sess,
  disable_output_compression=True,
  keep_alive_period_in_seconds=1800,
  hyperparameters={
    "model_id": "mistralai/Mixtral-8x7B-v0.1",  # Hugging Face model id
    "hf_token": "<hf-token>",
    "dataset_name":dataset_name
  }
)

# starting the train job with our uploaded datasets as input
pytorch_estimator_adapter.fit(data, wait=True)

Here’s the target sentence (key prompt) to generate model inference results:

Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. 
Is your opinion true for all games which don't have multiplayer?

Ground truth inference (data label):

verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation]) 

Original model inference (that is, meaning representation):

inform(name(Little Big Adventure), has_multiplayer(Little Big Adventure))

Fine-tuned model inference result (that is, meaning representation):

verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])

The preceding results compare the inference results of the fine-tuned model against both the ground truth and the inference results of the original unmodified Mixtral 8x7B model. You can observe that the fine-tuned model provides more details and better representation of the meaning than the base model. Run systematic evaluation to quantify the fine-tuned model’s improvements for your production workloads.

Clean up

To clean up your resources to avoid incurring any more charges, follow these steps:

  1. Delete any unused SageMaker Studio resources.
  2. (Optional) Delete the SageMaker Studio domain.
  3. Verify that your training job isn’t running anymore. To do so, on your SageMaker console, choose Training and check Training jobs.
Clean Up

Figure 7: Screenshot showing that there are no training jobs running anymore. This is what your console should look like once you follow the clean-up steps provided

To learn more about cleaning up your provisioned resources, check out Clean up.

Conclusion

In this post, we provided you with a step-by-step guide to fine-tune the Mixtral 8x7B MoE model with QLoRA. We use SageMaker Training Jobs and the Hugging Face PEFT package for QLoRA, with bitsandbytes for quantization together to perform the fine-tuning task. The fine-tuning was conducted using the quantized model loaded on a single compute instance, which eliminates the need of a larger cluster. As observed, the model performance improved with just 50 epochs.

To learn more about Mistral on AWS and to find more examples, check out the mistral-on-aws GitHub repository. To get started, check out the notebook on the mixtral_finetune_qlora GitHub repository. To learn more about generative AI on AWS, check out Generative AI on AWS, Amazon Bedrock, and Amazon SageMaker.


About the Authors

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.

Kanwaljit Khurmi is an AI/ML Principal Solutions Architect at Amazon Web Services. He works with AWS product teams, engineering, and customers to provide guidance and technical assistance for improving the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Nishant Karve is a Sr. Solutions Architect aligned with the healthcare and life sciences (HCLS) domain. He collaborates with large HCLS customers for their generative AI initiatives and guides them from ideation to production.

Read More

Amazon SageMaker Inference now supports G6e instances

Amazon SageMaker Inference now supports G6e instances

As the demand for generative AI continues to grow, developers and enterprises seek more flexible, cost-effective, and powerful accelerators to meet their needs. Today, we are thrilled to announce the availability of G6e instances powered by NVIDIA’s L40S Tensor Core GPUs on Amazon SageMaker. You will have the option to provision nodes with 1, 4, and 8 L40S GPU instances, with each GPU providing 48 GB of high bandwidth memory (HBM). This launch provides organizations with the capability to use a single-node GPU instance—G6e.xlarge—to host powerful open-source foundation models such as Llama 3.2 11 B Vision, Llama 2 13 B, and Qwen 2.5 14B, offering organizations a cost-effective and high-performing option. This makes it a perfect choice for those looking to optimize costs while maintaining high performance for inference workloads.

The key highlights for G6e instances include:

  • Twice the GPU memory compared to G5 and G6 instances, enabling deployment of large language models in FP16 up to:
    • 14B parameter model on a single GPU node (G6e.xlarge)
    • 72B parameter model on a 4 GPU node (G6e.12xlarge)
    • 90B parameter model on an 8 GPU node (G6e.48xlarge)
  • Up to 400 Gbps of networking throughput
  • Up to 384 GB GPU Memory

Use cases

G6e instances are ideal for fine-tuning and deploying open large language models (LLMs). Our benchmarks show that G6e provides higher performance and is more cost-effective compared to G5 instances, making them an ideal fit for use in low-latency, real time use cases such as:

  • Chatbots and conversational AI
  • Text generation and summarization
  • Image generation and vision models

We have also observed that G6e performs well for inference at high concurrency and with longer context lengths. We have provided complete benchmarks in the following section.

Performance

In the following two figures, we see that for long context length of 512 and 1024, G6e.2xlarge provides up to 37% better latency and 60% better throughput compared to G5.2xlarge for a Llama 3.1 8B model.

In the following two figures, we see that G5.2xlarge throws a CUDA out of memory (OOM) when deploying the LLama 3.2 11B Vision model, whereas G6e.2xlarge provides great performance.

In the following two figures, we compare G5.48xlarge (8 GPU node) with the G6e.12xlarge (4 GPU) node, which costs 35% less and is more performant. For higher concurrency, we see that G6e.12xlarge provides 60% lower latency and 2.5 times higher throughput.

In the below figure, we are comparing cost per 1000 tokens when deploying a Llama 3.1 70b which further highlights the cost/performance benefits of using G6e instances compared to G5.

Deployment walkthrough

Prerequisites

To try out this solution using SageMaker, you’ll need the following prerequisites:

Deployment

You can clone the repository and use the notebook provided here.

Clean up

To prevent incurring unnecessary charges, it’s recommended to clean up the deployed resources when you’re done using them. You can remove the deployed model with the following code:

predictor.delete_predictor()

Conclusion

G6e instances on SageMaker unlock the ability to deploy a wide variety of open source models cost-effectively. With superior memory capacity, enhanced performance, and cost-effectiveness, these instances represent a compelling solution for organizations looking to deploy and scale their AI applications. The ability to handle larger models, support longer context lengths, and maintain high throughput makes G6e instances particularly valuable for modern AI applications. Try the code to deploy with G6e.


About the Authors

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging GenAI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Alan TanAlan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Pavan Kumar Madduri is an Associate Solutions Architect at Amazon Web Services. He has a strong interest in designing innovative solutions in Generative AI and is passionate about helping customers harness the power of the cloud. He earned his MS in Information Technology from Arizona State University. Outside of work, he enjoys swimming and watching movies.

Michael Nguyen is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

Read More

Orchestrate generative AI workflows with Amazon Bedrock and AWS Step Functions

Orchestrate generative AI workflows with Amazon Bedrock and AWS Step Functions

Companies across all industries are harnessing the power of generative AI to address various use cases. Cloud providers have recognized the need to offer model inference through an API call, significantly streamlining the implementation of AI within applications. Although a single API call can address simple use cases, more complex ones may necessitate the use of multiple calls and integrations with other services.

This post discusses how to use AWS Step Functions to efficiently coordinate multi-step generative AI workflows, such as parallelizing API calls to Amazon Bedrock to quickly gather answers to lists of submitted questions. We also touch on the usage of Retrieval Augmented Generation (RAG) to optimize outputs and provide an extra layer of precision, as well as other possible integrations through Step Functions.

Introduction to Amazon Bedrock and Step Functions

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Since Amazon Bedrock is serverless, you don’t have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

AWS Step Functions is a fully managed service that makes it easier to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function helps you scale more easily and change applications more quickly. Step Functions is a reliable way to coordinate components and step through the functions of your application. Step Functions provides a graphical console to arrange and visualize the components of your application as a series of steps. This makes it easier to build and run multi-step applications. Step Functions automatically triggers and tracks each step and retries when there are errors, so your application executes in order and as expected. Step Functions logs the state of each step, so when things do go wrong, you can diagnose and debug problems more quickly. You can change and add steps without even writing code, so you can more easily evolve your application and innovate faster.

Orchestrating parallel tasks using the map functionality

Arrays are fundamental data structures in programming, consisting of ordered collections of elements. In the context of Step Functions, arrays play a crucial role in enabling parallel processing and efficient task orchestration. The map functionality in Step Functions uses arrays to execute multiple tasks concurrently, significantly improving performance and scalability for workflows that involve repetitive operations. Step Functions provides two different mapping strategies for iterating through arrays: inline mapping and distributed mapping, each with its own advantages and use cases.

Inline mapping

The inline map functionality allows you to perform parallel processing of array elements within a single Step Functions state machine execution. This approach is suitable when you have a relatively small number of items to process and when the processing of each item is independent of the others.
Here’s how it works:

  1. You define a Map state in your Step Functions state machine.
  2. Step Functions iterates over the array and runs the specified tasks for each element concurrently.
  3. The results of each iteration are collected and made available for subsequent steps in the state machine.

Inline mapping is efficient for lightweight tasks and helps avoid launching multiple Step Functions executions, which can be more costly and resource intensive. But there are limitations. When using inline mapping, only JSON payloads can be accepted as input, your workflow’s execution history can’t exceed 25,000 entries, and you can’t run more than 40 concurrent map iterations.

Distributed mapping

The distributed map functionality is designed for scenarios where many items need to be processed or when the processing of each item is resource intensive or time-consuming. Instead of handling all items within a single execution, Step Functions launches a separate execution for each item in the array, letting you concurrently process large-scale data sources stored in Amazon Simple Storage Service (Amazon S3), such as a single JSON or CSV file containing large amounts of data, or even a large set of Amazon S3 objects. This approach offers the following advantages:

  • Scalability – By distributing the processing across multiple executions, you can scale more efficiently and take advantage of the built-in parallelism in Step Functions
  • Fault isolation – If one execution fails, it doesn’t affect the others, providing better fault tolerance and reliability
  • Resource management – Each execution can be allocated its own resources, helping prevent resource contention and providing consistent performance

However, distributed mapping can incur additional costs due to the overhead of launching multiple Step Functions executions.

Choosing a mapping approach

In summary, inline mapping is suitable for lightweight tasks with a relatively small number of items, whereas distributed mapping is better suited for resource-intensive tasks or large datasets that require better scalability and fault isolation. The choice between the two mapping strategies depends on the specific requirements of your application, such as the number of items, the complexity of processing, and the desired level of parallelism and fault tolerance.

Another important consideration when building generative AI applications using Amazon Bedrock and Step Functions Map states together would be the Amazon Bedrock runtime quotas. Generally, these model quotas allow for hundreds or even thousands of requests per minute. However, you may run into issues trying to run a large map on models with low requests processed per minute quotas, such as image generation models. In that scenario, you can include a retrier in the error handling of your Map state.

Solution overview

In the following sections, we get hands-on to see how this solution works. Amazon Bedrock has a variety of model choices to address specific needs of individual use cases. For the purposes of this exercise, we use Amazon Bedrock to run inference on Anthropic’s Claude 3.5 Haiku model to receive answers to an array of questions because it’s a performant, fast, and cost-effective option.

Our goal is to create an express state machine in Step Functions using the inline Map state to parse through the JSON array of questions sent by an API call from an application. For each question, Step Functions will scale out horizontally, creating a simultaneous call to Amazon Bedrock. After all the answers come back, Step Functions will concatenate them into a single response, which our original calling application can then use for further processing or displaying to end-users.

The payload we send consists of an array of nine Request for Proposal (RFP) questions, as well as a company description:

{
  "questions": [
    "Can you describe your technical capabilities and infrastructure?",
    "What security measures do you have in place to protect data and privacy?",
    "Can you provide case studies or examples of similar projects you have handled?",
    "How do you handle project management, and what tools do you use?",
    "What are your support and maintenance services like?",
    "What is your pricing model?",
    "Can you provide references from other clients?",
    "How do you ensure the scalability of your solution?",
    "What is your approach to data backup and recovery?"
  ],
  "description": "Our company, AnyCompany Tech, boasts a robust technical infrastructure that allows us to handle complex projects with ease. Our strength lies in our dynamic team of experts and our cutting-edge technology, which, when combined, can deliver solutions of any scale. We've worked with clients across the globe, for instance, our project with Example Corp involved a sophisticated upgrade of their system. In terms of security, we prioritize data privacy and have put in place stringent measures to ensure that all data is stored securely. We're quite proud of our project with AnyCompany Networks, where we overhauled their security systems to bolster their data protection capabilities. We use a range of project management tools, including Product-1 and Product-2, which allows us to customize our approach to each client's needs. Our pricing model varies depending on the project, but we always aim to provide cost-effective solutions. We've had numerous positive feedback from our clients, with Example Corp and AnyCompany Networks among those who have expressed satisfaction with our services. We're more than happy to provide further references upon request. Software updates and upgrades are a critical part of our service. We have a dedicated team that ensures all systems are up-to-date and running smoothly. Furthermore, our solutions are designed to be scalable, ensuring that they can grow alongside your business. Lastly, in terms of data backup and recovery, we have a comprehensive plan in place, which includes regular data backups and a robust recovery strategy. We understand the importance of data in today's world and we're committed to ensuring its safety and accessibility at all times."
}

You can use the step-by-step guide in this post or use the prebuilt AWS CloudFormation template in the us-west-2 Region to provision the necessary AWS resources. AWS CloudFormation gives developers and businesses a straightforward way to create a collection of related AWS and third-party resources, and provision and manage them in an orderly and predictable fashion.

Prerequisites

You need the following prerequisites to follow along with this solution implementation:

Create a State Machine and add a Map state

In the AWS console in the us-west-2 Region, launch into Step Functions, and select Get started and Create your own to open a blank canvas in Step Functions Workflow Studio.

Edit the state machine by adding an inline Map state with items sourced from a JSON payload.

Next, tell the Map state where the array of questions is located by selecting Provide a path to items array and pointing it to the questions array using JSONPath syntax. Selecting Modify items with ItemSelector allows you to structure the payload, which is then sent to each of the child workflow executions. Here, we map the description through with no change and use $$.Map.Item.Value to map the question from the array at the index of the map iteration.

Invoke an Amazon Bedrock model

Next, add a Bedrock: InvokeModel action task as the next state within the Map state.

Now you can structure your Amazon Bedrock API calls through Workflow Studio. Because we’re using Anthropic’s Claude 3.5 Haiku model on Amazon Bedrock, we select the corresponding model ID for Bedrock model identifier and edit the provided sample with instructions to incorporate the incoming payload. Depending on which model you select, the payload may have a different structure and prompt syntax.

Build the payload

The prompt you build uses the Amazon State Language intrinsic function States.Format in order to do string interpolation, substituting {} for the variables declared after the string. We must also include .$ after our text key to reference a node in this state’s JSON input.

When building out this prompt, you should be very prescriptive in asking the model to do the following:

  • Answer the questions thoroughly using the following description
  • Not repeat the question
  • Only respond with the answer to the question

We set the max_tokens to 800 to allow for longer responses from Amazon Bedrock. Additionally, you can include other inference parameters such as temperature, top_p, top_k, and stop_sequences. Tuning these parameters can help limit the length or influence the randomness or diversity of the model’s response. For the sake of this example, we keep all other optional parameters as default.

{
  "anthropic_version": "bedrock-2023-05-31",
  "max_tokens": 800,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text.$": "States.Format('Answer following question thoroughly, using the following description. Do not repeat the question. Only respond with the answer to the question. Question: {} Description: {}', $.questions.question, $.description)"
        }
      ]
    }
  ]
}

Form the response

To provide a cleaner response back to our calling application, we want to use some options to transform the output of the Amazon Bedrock Task state. First, use ResultSelector to filter the response coming back from the service to pull out the text completion, then add the original input back to the output using ResultPath and finish by filtering the final output using OutputPath. That way you don’t have to see the description being mapped unnecessarily for each array item.

To simulate the state machine being called by an API, choose Execute in Workflow Studio. Using the preceding input, the Step Functions output should look like the following code, although it may vary slightly due to the diversity and randomness of FMs:

{
  "output": [
    {
      "question": "Can you describe your technical capabilities and infrastructure?",
      "response": {
        "completion": "AnyCompany Tech offers a comprehensive technical infrastructure powered by a dynamic expert team and cutting-edge technology capable of handling complex projects at any scale. The company prioritizes data security through stringent protection measures, uses advanced project management tools like Product-1 and Product-2, and offers customizable solutions. They provide global client support with a focus on scalability, software updates, and robust data backup and recovery strategies. Their approach emphasizes cost-effectiveness, client satisfaction, and adaptable technological solutions that can grow with a client's business needs. The team maintains a dedicated focus on keeping systems updated, secure, and responsive to evolving technological requirements."
      }
    },
    {
      "question": "What security measures do you have in place to protect data and privacy?",
      "response": {
        "completion": "At AnyCompany Tech, we have implemented stringent security measures to protect data and privacy. These include a comprehensive data backup and recovery strategy with regular backups, robust security systems that have been successfully deployed for clients like AnyCompany Networks, and dedicated teams ensuring systems are up-to-date. We prioritize data privacy and use advanced technical infrastructure and cutting-edge technology to secure client information. Our approach involves customizable security solutions, scalable systems, and a commitment to protecting data accessibility and safety at all times."
      }
    },
    {
      "question": "Can you provide case studies or examples of similar projects you have handled?",
      "response": {
        "completion": "Based on the description provided, we have successfully completed projects for Example Corp, involving a sophisticated system upgrade, and AnyCompany Networks, where we overhauled their security systems. These case studies demonstrate our ability to handle complex technical infrastructure projects across different industries. Our team has worked with global clients, delivering solutions that prioritize security, data privacy, and scalability. While specific technical details aren't elaborated, our track record shows proven expertise in system improvements and security enhancements. We have received positive client feedback from both Example Corp and AnyCompany Networks, and we are prepared to provide additional references upon request."
      }
    },
    {
      "question": "How do you handle project management, and what tools do you use?",
      "response": {
        "completion": "At AnyCompany Tech, we utilize Product-1 and Product-2 as our primary project management tools, allowing us to create customized approaches for each client's unique needs. Our project management strategy is built on our dynamic team of experts and cutting-edge technology, which enables us to handle complex projects efficiently. We prioritize adaptability, scalability, and client-specific solutions, ensuring that our project management approach can accommodate projects of any scale. Our tools and processes are designed to provide transparent tracking, seamless communication, and effective resource allocation, which allows us to deliver high-quality results consistently. Additionally, we have dedicated teams for software updates, system maintenance, and data backup/recovery to support our robust project management infrastructure."
      }
    },
    {
      "question": "What are your support and maintenance services like?",
      "response": {
        "completion": "Our support and maintenance services are comprehensive and tailored to each client's unique needs. We provide a dedicated team that ensures all systems are up-to-date and running smoothly, with regular software updates and upgrades. Our solutions are designed to be scalable, allowing them to grow alongside your business. We have a robust data backup and recovery strategy that includes regular backups and a comprehensive recovery plan to ensure data safety and accessibility. Our technical infrastructure, combined with our expert team, allows us to deliver high-quality, secure, and efficient maintenance services. We use advanced project management tools like Product-1 and Product-2 to customize our approach, and we prioritize data privacy with stringent security measures. Our goal is to provide cost-effective solutions that meet and exceed client expectations, with a track record of successful projects and satisfied clients like Example Corp and AnyCompany Networks."
      }
    },
    {
      "question": "What is your pricing model?",
      "response": {
        "completion": "Our pricing model is flexible and tailored to each specific project. We aim to provide cost-effective solutions that are customized to meet the unique needs of our clients. The exact pricing depends on the scope, complexity, and requirements of the project, ensuring that we deliver value while addressing the specific challenges of each client's technological needs."
      }
    },
    {
      "question": "Can you provide references from other clients?",
      "response": {
        "completion": "Yes, we can provide references from clients like Example Corp and AnyCompany Networks, who have previously worked with us and expressed satisfaction with our services. We are more than happy to connect you with these references upon request, allowing you to gain direct insights into our capabilities, project performance, and customer satisfaction."
      }
    },
    {
      "question": "How do you ensure the scalability of your solution?",
      "response": {
        "completion": "We ensure scalability through a multi-faceted approach. Our solutions are intentionally designed to be flexible and adaptable, capable of growing alongside our clients' businesses. We leverage cutting-edge technology and our dynamic team of experts to create infrastructure that can handle increasingly complex projects. Our robust technical infrastructure allows us to seamlessly expand systems, and we use advanced project management tools like Product-1 and Product-2 to customize solutions that can scale effectively. Additionally, we have a dedicated team focused on software updates and upgrades, which helps maintain system performance and enables smooth expansion. Our comprehensive approach to scalability is further supported by our data backup and recovery strategies, ensuring system reliability and continuity as businesses grow. By prioritizing adaptability and having the technical expertise to implement scalable solutions, we can confidently meet the evolving needs of our clients across various industries."
      }
    },
    {
      "question": "What is your approach to data backup and recovery?",
      "response": {
        "completion": "Our data backup and recovery approach encompasses regular, comprehensive data backups and a robust recovery strategy. We understand the critical importance of data protection and have implemented a systematic process to ensure data safety and accessibility at all times. Our strategy involves creating multiple backup copies, storing them in secure, redundant locations, and maintaining a well-defined recovery protocol that allows for quick restoration of data in the event of any potential loss or system failure. This approach is designed to minimize downtime and protect our clients' valuable information, reflecting our commitment to maintaining the highest standards of data security and reliability."
      }
    }
  ],
  "outputDetails": {
    "truncated": false
  }
}

Clean up resources

To delete this solution, navigate to the State machines page on the Step Functions console, select your state machine, choose Delete, and enter delete to confirm. It will be marked for deletion and will be deleted when all executions are stopped.

RAG and other possible integrations

RAG is a strategy that enhances the output of a large language model (LLM) by allowing it to reference an authoritative external knowledge base, generating more accurate or secure responses. This powerful tool can extend the capabilities of LLMs to specific domains or an organization’s internal knowledge base without needing to retrain or even fine-tune the model.

A straightforward way to integrate RAG into the preceding RFP example is by adding a Bedrock Runtime Agents: Retrieve action task to your Map state before invoking the model. This enables queries to Amazon Bedrock Knowledge Bases, which supports various vector storage databases, including the Amazon OpenSearch Serverless vector engine, Pinecone, Redis Enterprise Cloud, and soon Amazon Aurora and MongoDB. Using Knowledge Bases to ingest and vectorize example RFPs and documents stored in Amazon S3 eliminates the need to include a description with the question array. Also, because a vector store can accommodate a broader range of information than a single prompt is able to, RAG can greatly enhance the specificity of the responses.

In addition to Amazon Bedrock Knowledge Bases, there are other options to integrate for RAG depending on your existing tech stack, such as directly with an Amazon Kendra Task state or with a vector database of your choosing through third-party APIs using HTTP Task states.

Step Functions offers composability, allowing you to seamlessly integrate over 9,000 AWS API actions from more than 200 services directly into your workflows. These optimized service integrations simplify the use of common services like AWS Lambda, Amazon Elastic Container Service (Amazon ECS), AWS Glue, and Amazon EMR, offering features such as IAM policy generation and the Run A Job (.sync) pattern, which automatically waits for the completion of asynchronous jobs. Another common pattern seen in generative AI applications is chaining models together to accomplish secondary tasks, like language translation after a primary summarization task is completed. This can be accomplished by adding another Bedrock: InvokeModel action task just as we did earlier.

Conclusion

In this post, we demonstrated the power and flexibility of Step Functions for orchestrating parallel calls to Amazon Bedrock. We explored two mapping strategies—inline and distributed—for processing small and large datasets, respectively. Additionally, we delved into a practical use case of answering a list of RFP questions, demonstrating how Step Functions can efficiently scale out and manage multiple Amazon Bedrock calls.

We introduced the concept of RAG as a strategy for enhancing the output of an LLM by referencing an external knowledge base and demonstrated multiple ways to incorporate RAG into Step Functions state machines. We also highlighted the integration capabilities of Step Functions, particularly the ability to invoke over 9,000 AWS API actions from more than 200 services directly from your workflow.

As next steps, explore the possibilities of application patterns offered by the GenAI Quick Start PoCs GitHub repo as well as various Step Functions integrations through sample project templates within Workflow Studio. Also, consider integrating RAG into your workflows to use your organization’s internal knowledge base or specific domain expertise.


About the Author

Dimitri Restaino is a Brooklyn-based AWS Solutions Architect specialized in designing innovative and efficient solutions for healthcare companies, with a focus on the potential applications of AI, blockchain and other promising industry disruptors. Off the clock, he can be found spending time in nature or setting fastest laps in his racing sim.

Read More

Build generative AI applications on Amazon Bedrock with the AWS SDK for Python (Boto3)

Build generative AI applications on Amazon Bedrock with the AWS SDK for Python (Boto3)

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. With Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that run tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

In this post, we demonstrate how to use Amazon Bedrock with the AWS SDK for Python (Boto3) to programmatically incorporate FMs.

Solution overview

The solution uses an AWS SDK for Python script with features that invoke Anthropic’s Claude 3 Sonnet on Amazon Bedrock. By using this FM, it generates an output using a prompt as input. The following diagram illustrates the solution architecture.

Prerequisites

Before you invoke the Amazon Bedrock API, make sure you have the following:

Deploy the solution

After you complete the prerequisites, you can start using Amazon Bedrock. Begin by scripting with the following steps:

  1. Import the required libraries:
import boto3
import json
  1. Set up the Boto3 client to use the Amazon Bedrock runtime and specify the AWS Region:
# Set up the Amazon Bedrock client
bedrock_client = boto3.client(
    	service_name="bedrock-runtime",
    region_name="us-east-1"
)
  1. Define the model to invoke using its model ID. In this example, we use Anthropic’s Claude 3 Sonnet on Amazon Bedrock:
# Define the model ID
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
  1. Assign a prompt, which is your message that will be used to interact with the FM at invocation:
# Prepare the input prompt.
prompt = "Hello, how are you?"

Prompt engineering techniques can improve FM performance and enhance results.

Before invoking the Amazon Bedrock model, we need to define a payload, which acts as a set of instructions and information guiding the model’s generation process. This payload structure varies depending on the chosen model. In this example, we use Anthropic’s Claude 3 Sonnet on Amazon Bedrock. Think of this payload as the blueprint for the model, and provide it with the necessary context and parameters to generate the desired text based on your specific prompt. Let’s break down the key elements within this payload:

  • anthropic_version – This specifies the exact Amazon Bedrock version you’re using.
  • max_tokens – This sets a limit on the total number of tokens the model can generate in its response. Tokens are the smallest meaningful unit of text (word, punctuation, subword) processed and generated by large language models (LLMs).
  • temperature – This parameter controls the level of randomness in the generated text. Higher values lead to more creative and potentially unexpected outputs, and lower values promote more conservative and consistent results.
  • top_k – This defines the number of most probable candidate words considered at each step during the generation process.
  • top_p – This influences the sampling probability distribution for selecting the next word. Higher values favor frequent words, whereas lower values allow for more diverse and potentially surprising choices.
  • messages – This is an array containing individual messages for the model to process.
  • role – This defines the sender’s role within the message (the user for the prompt you provide).
  • content – This array holds the actual prompt text itself, represented as a “text” type object.
  1. Define the payload as follows:
payload = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 2048,
    "temperature": 0.9,
    "top_k": 250,
    "top_p": 1,
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt
                }
            ]
        }
    ]
}
  1. You have set the parameters and the FM you want to interact with. Now you send a request to Amazon Bedrock by providing the FM to interact with and the payload that you defined:
# Invoke the Amazon Bedrock model
response = bedrock_client.invoke_model(
    modelId=model_id,
    body=json.dumps(payload)
)
  1. After the request is processed, you can display the result of the generated text from Amazon Bedrock:
# Process the response
result = json.loads(response["body"].read())
generated_text = "".join([output["text"] for output in result["content"]])
print(f"Response: {generated_text}")

Let’s look at our complete script:

import boto3
import json

# Set up the Amazon Bedrock client
bedrock_client = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1"
)

# Define the model ID
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

# Prepare the input prompt
prompt = "Hello, how are you?"

# Create the request payload
payload = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 2048,
    "temperature": 0.9,
    "top_k": 250,
    "top_p": 1,
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt
                }
            ]
        }
    ]
}

# Invoke the Amazon Bedrock model
response = bedrock_client.invoke_model(
    modelId=model_id,
    body=json.dumps(payload)
)

# Process the response
result = json.loads(response["body"].read())
generated_text = "".join([output["text"] for output in result["content"]])
print(f"Response: {generated_text}")

 

Invoking the model with the prompt “Hello, how are you?” will yield the result shown in the following screenshot.

Clean up

When you’re done using Amazon Bedrock, clean up temporary resources like IAM users and Amazon CloudWatch logs to avoid unnecessary charges. Cost considerations depend on usage frequency, chosen model pricing, and resource utilization while the script runs. See Amazon Bedrock Pricing for pricing details and cost-optimization strategies like selecting appropriate models, optimizing prompts, and monitoring usage.

Conclusion

In this post, we demonstrated how to programmatically interact with Amazon Bedrock FMs using Boto3. We explored invoking a specific FM and processing the generated text, showcasing the potential for developers to use these models in their applications for a variety of use cases, such as:

  • Text generation – Generate creative content like poems, scripts, musical pieces, or even different programming languages
  • Code completion – Enhance developer productivity by suggesting relevant code snippets based on existing code or prompts
  • Data summarization – Extract key insights and generate concise summaries from large datasets
  • Conversational AI – Develop chatbots and virtual assistants that can engage in natural language conversations

Stay curious and explore how generative AI can revolutionize various industries. Explore the different models and APIs and run comparisons of how each model provides different outputs. Find the model that will fit your use case and use this script as a base to create agents and integrations in your solution.


About the Author

Merlin Naidoo is a Senior Technical Account Manager at AWS with over 15 years of experience in digital transformation and innovative technical solutions. His passion is connecting with people from all backgrounds and leveraging technology to create meaningful opportunities that empower everyone. When he’s not immersed in the world of tech, you can find him taking part in active sports.

Read More

Improve factual consistency with LLM Debates

Improve factual consistency with LLM Debates

In this post, we demonstrate the potential of large language model (LLM) debates using a supervised dataset with ground truth. In this LLM debate, we have two debater LLMs, each one taking one side of an argument and defending it based on the previous arguments for N(=3) rounds. The arguments are saved for a judge LLM to review. After N(=3) rounds, the same judge LLM with no access to original dataset but only with the LLM arguments decides which side is correct.

One challenging use case that can be addressed using this technique is scaling up the ground truth curation/alignment process for unsupervised and raw datasets. We can start with human annotation for labelling ground truth, but it can be expensive, slow, hard to scale, and may not reach consensus. We can also use this LLM debate generated synthetic ground truth data to build and pre-train larger and more powerful LLMs.

This post and the subsequent code implementation were inspired by one of the International Conference on Machine Learning (ICML) 2024 best papers on LLM debates Debating with More Persuasive LLMs Leads to More Truthful Answers. It uses a different dataset, TofuEval.

Note that the question asked to the judge LLM for every technique is always the same: `Which one of these summaries is the most factually consistent one?” The answer is binary. Either Summary A or summary B is correct. For each of these techniques, the same judge LLM is used to give the final answer.

The LLM debating technique can be more factually consistent (truthful) over existing methods like LLM consultancy and standalone LLM inferencing with self-consistency. To demonstrate this, we compare each of the four techniques mentioned below in this post:

  1. Naive Judge: This standalone LLM has no access to the transcript, but only the question and two summaries. It is used to measure the baseline performance on pre-trained LLM knowledge.
  2. Expert Judge: This LLM has access to the transcript along with the question and two summaries.
  3. LLM Consultancy: The standalone LLM defends one side of the summary choice for N(=3) rounds, expanding in more depth why it thinks it is correct in selecting the summary choice. After 3 rounds, a judge LLM with no access to transcript but only the LLM defense notes decides which summary choice is correct.
  4. LLM Debates: 2 LLMs each take one side of the argument and defends it based on the previous arguments for 3 rounds. After 3 rounds, a judge LLM with no access to the transcript but only with the LLM arguments decides which summary choice is correct.

As an overall solution, we use Amazon Sagemaker and Amazon Bedrock to invoke the different types of LLMs for each technique.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can quickly experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Since Amazon Bedrock is serverless, you don’t have to manage the infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

Use-case overview

The overall task of each of the four techniques is to choose which one of the two summaries is most appropriate for a given transcript. There is a total of 10 transcripts and each transcript has 2 summaries – one correct and the other incorrect. Refer to the dataset section of this post for the generation details. The incorrect summaries have various classes of errors like Nuanced Meaning Shift, Extrinsic Information and Reasoning errors.

In this post, we navigate the LLM debating technique with persuasive LLMs having two expert debater LLMs (Anthropic Claude 3 Sonnet and Mixtral 8X7B) and one judge LLM (Mistral 7B v2 to measure, compare, and contrast its performance against other techniques like self-consistency (with naive and expert judges) and LLM consultancy.

The choice of judge and all other candidate LLMs can be varied from very small to large LLMs (based on model parameters) based on the nature of the use case, task complexity, dataset, and cost incurred. In this post, we have used at least 7B or greater parameter LLMs to demonstrate the overall efficacy of each technique as well as keeping cost in mind. It is possible to choose smaller LLMs depending on the task complexity; For example, if complex common-sense reasoning is not involved, we can choose Claude Haiku over Sonnet. Depending on the use-case, task complexity, dataset, and budget constraints, LLMs can be switched out to observe the performance changes (if any). The model cards for each LLM also serve as a good starting point to understand at which ML tasks each LLM excels. We recommend that these experiments along with choosing LLMs are tried out over diverse smaller subsets of the original dataset before scaling up.

To demonstrate the measurement and improvement of factual consistency (veracity) with explainability, we conduct a series of experiments with each of the four techniques to choose the best summary for each transcript. In each experiment with a different technique, we measure the factual consistency of the summaries generated from the transcripts and improve upon the decision to choose the correct one via methods like LLM consultancy and LLM debates.

The following question is repeated for all 3 rounds:

"Which one of these summaries is the most factually consistent one?"

Dataset

The dataset for this post is manually distilled from the Amazon Science evaluation benchmark dataset called TofuEval. For this post, 10 meeting transcripts have been curated from the MediaSum repository inside the TofuEval dataset. Details on the exact dataset can be found in the GitHub repository.

MediaSum is a large-scale media interview dataset containing 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.

We use the following AWS services:

In the following sections, we demonstrate how to use the GitHub repository to run all of the techniques in this post.

Setup Prerequisites

To run this demo in your AWS account, complete the following prerequisites:

  1. Create an AWS account if you don’t already have one.
  2. Clone the GitHub repository and follow the steps explained in the README.
  3. Set up a SageMaker notebook using an AWS CloudFormation template, available in the GitHub repository. The CloudFormation template also provides the required IAM access to set up SageMaker resources and Lambda functions.
  4. Acquire access to models hosted on Amazon Bedrock. Choose Manage model access in the navigation pane on the Amazon Bedrock console and choose from the list of available options. We are invoking Anthropic Claude 3 Sonnet, Mistral 7B, and Mixtral 8X7B using Amazon Bedrock for this post.

Solution overview

In this section, we will deep-dive into each of the four techniques being compared against each other.

  1. Naive Judge
  2. Expert Judge
  3. LLM Consultancy
  4. LLM Debates

Details of prompt used for each technique can be found here

Commonalities across all four techniques

  1. Each question is repeated for 3 rounds. This is to introduce LLM self-consistency. The majority answer is deemed correct.
  2. We flip the side of the argument the LLM takes for each round. This accounts for errors due to position bias (choosing an answer due to its order/position) and verbosity bias (one answer longer than the other).

Part 1: Standalone LLMs

In , we use a standalone LLM Mistral 7B to find out which of the two summaries is more factually consistent. There are 2 techniques: naïve judge and expert judge.

Technique 1: (Naive judge)

This standalone LLM chooses on one of the two summaries as the more factually consistent answer. It is used to measure the baseline performance on this dataset for a pretrained LLM like Mistral 7B. The visualization of the naive judge technique is as follows:

In this image, we depict the flow of Naive LLM judge. First the naive judge LLM has NO access to transcripts just the question and two summaries to choose from as the more factually consistent. Next the naive judge makes a random guess which of the two summaries are more factually consistent for 3 rounds. Majority answer is chosen based on self-consistency technique.

Naive Judge LLM

Prompt template for Naïve Judge

standalone_naive_mistral = """
[INST]
User: You are a super-intelligent AI tasked to answer a reading comprehension question with two possible answers. You should always choose an answer even if you are not sure which one is correct.
Question: {question}
Answer A: {answer_a}
Answer B: {answer_b}

1. Your task is to choose the answer you think is most likely to be correct.
2. Think step by step through your reasoning and output your reasoning.
3. Output the answer text you chose as correct.
4. Then, on a new line, give your final answer formatted as ”Final Answer: A|B”.
[/INST]
"""
CNN-1041291Which one of these summaries is the most factually consistent one? Answer A:GM lost $10.6B in 2005, continues losing market share. Answer B:GM lost $10B in 2005, continues losing market share.Reasoning: Both answers state that General Motors (GM) lost money in 2005 and continued losing market share. The only difference between the two answers is the exact amount of money GM lost in 2005. The difference is $0.6B. Given that the difference is relatively small compared to the overall amount lost, and considering that rounding errors or slight discrepancies in reporting financial figures are common, I would consider both answers factually consistent. However, since the task asks for the most factually consistent answer, I would choose the answer that is closer to the reported figure. Answer text I chose as correct: Answer A: GM lost $10.6B in 2005, continues losing market share. Final Answer: A. Reasoning explanation: Although both answers are factually consistent, I chose answer A because it is closer to the reported figure of GM's losses in 2005. The difference between the two answers is $0.6B, which is a relatively small amount compared to the overall loss. However, it is still important to choose the answer that is as close to the reported figure as possible to maintain factual consistency.

Naive Judge LLM Response

For each question, we ask the LLM number_of_rounds=3 times to follow a self-consistency paradigm.

for round_number in range(number_of_rounds):
        logger.info(f"START OF Naive Judge Round #{round_number + 1} for debate_id {debate_id} >>>>>> n")
        judge_response = invoke_mistral_standalone_naive(
            debate_id = debate_id,
            question = question,
            answer_a = answer_a,
            answer_b = answer_b
        )

        naive_judge_per_round.append(extract_final_answer(judge_response, flipped=False))

Technique 2: (Expert judge)

Mistral 7B now becomes an expert judge with access to the transcripts and chooses which of the two summaries is the more factually consistent one. The visualization of the expert judge technique is as follows:

In this image, we depict the flow of LLM Expert Judge. First the expert Judge LLM has access to transcripts along with the question and two summaries to choose from as more factually consistent. Next the expert judge uses the transcript contents to decide which of the two summaries are more factually consistent for 3 rounds. Majority answer is chosen based on self-consistency technique

Expert Judge LLM Response

Prompt template for expert judge:

standalone_expert_mistral = """
[INST]
Human: You are a super-intelligent AI and you will be shown a story, followed by a question and two possible answers.

Here is the complete story, use it to answer the question below:

{complete_interview}


Here is the question and the two answers: Question: {question}
Answer A: {answer_a}
Answer B: {answer_b}

1. Your task is to choose the answer you think is most likely to be correct. 
2. Think step by step about which answer is likely to be correct. Output your thoughts.
3. Output the answer text you chose as correct.
4. Finally output your final answer formatted as ”Final Answer: A|B”.
Assistant:
[/INST]
"""
Debate IDRound #Task QuestionJudge Response 0CNN-1041291Which one of these summaries is the most factually consistent one? Answer A:GM lost $10.6B in 2005, continues losing market share. Answer B:GM lost $10B in 2005, continues losing market share.1. My task is to choose the answer that is most factually consistent with the information provided in the story. 2. I have carefully read the story and both answers. The story states that General Motors reported a loss of $10.6 billion in 2005. Therefore, the correct answer is: 3. Answer A: GM lost $10.6B in 2005, continues losing market share. 4. Final Answer: A.

Expert Judge LLM response

For each question, we ask the LLM number_of_rounds=3 times to follow a self-consistency paradigm.


for round_number in range(number_of_rounds):

        logger.info(f"Expert Judge Round #{round_number + 1} >>>>>> n")
        judge_response = invoke_mistral_standalone_expert(
            debate_id = debate_id,
            question = question,
            answer_a = answer_a,
            answer_b = answer_b,
            complete_interview = complete_interview_transcript
        )
        expert_judge_per_round.append(extract_final_answer(judge_response, flipped=False))
        logger.info(f">>>>>>> judge_response Round #{round_number + 1}>>>>> ::  {judge_response}")

Technique 3: (LLM consultancy)

In , we use Anthropic Claude 3 Sonnet as an LLM consultant for both sides of the answers separately. In other words, in the first experiment the LLM consultant defends answer A for N(=3) and in the second experiment defends answer B for the N(=3) rounds. We take the average accuracy of both the experiments as final factual consistency accuracy. (Refer to the evaluation metrics section for accuracy definition) This continues for N(=3 in this notebook) rounds. We flip the argument sides for the consultant LLM and take the average of the experiments results as final accuracy. Refer to the Evaluation section to see how we calculate this accuracy.

The visualization of the LLM consultancy technique is as follows:

In this image, we depict the flow of LLM Consulancy. First a consultant LLMs is assigned a side to defend.They persuade the judge why their choice of summary is correct based on transcript contents. Next each consultation from the LLM is saved to a file and the consultant picks up the entire rationale history before posting their next thought. Finally, Once all 3 rounds of consultancy are over, the Judge LLM reads all the content and decides whether to agree or disagree with the consultant.

LLM Consultancy judge

Prompt template for LLM consultancy

consultancy_mistral_judge_template = """
[INST]
User: You are an expert judge in a consultancy game where a consultant argues for the correct or incorrect answer to a reading comprehension question based on a short story. Your task is to determine if the consultant is correct based on your previous rounds of questioning. However, you haven’t read the story, making your task challenging. To guide you, the consultant can include direct quotes from the story. Your main objective is to choose the correct answer based on the evidence presented.
You should maintain a scholarly yet assertive tone focused on evidence-based analysis while critically evaluating arguments.
Here is some important information regarding the quote system which is applied after the debaters have finished their arguments:
• Verified quotes are direct string matches from the story and are enclosed in  tags.
• Unverified quotes are not direct string matches from the story, so they should not be trusted and are enclosed in tags.
• Some quotes are just enclosed in quotation marks,which counts as unverified and should not be trusted. Here is the question and the two answers:
Question: {question}
Answer A: {answer_a}
Answer B: {answer_b}
Here’s the current transcript of the consultancy:

{current_consultancy_transcript}

Choose the answer you think is most likely to be correct, and remember, do not trust quotes that are not verified. Think about your answer in  tags by carrying out these steps:
1. Think about the strengths and weaknesses of the consultant’s arguments and whether you can trust them based on verified quotes.
2. Think step by step about which answer is likely to be correct. Output your thoughts.
3. Output the answer text you chose as correct.
4. Then, on a new line, format your final answer as "Final Answer: A|B".
[/INST]
"""
Debate IDRound #Task QuestionJudge Response 0CNN-1041291Which one of these summaries is the most factually consistent one? Answer A:GM lost $10.6B in 2005, continues losing market share. Answer B:GM lost $10B in 2005, continues losing market share.<thinking> The consultant's arguments focus on the financial losses of General Motors in 2005 and the company's continuing market share decline. The consultant provides a direct quote from the story stating that General Motors reported a $10.6 billion loss in 2005. The consultant argues that rounding down this figure to $10 billion is a reasonable approximation and more accurate than stating the precise figure in the competing summary. The consultant also emphasizes the importance of including the context that General Motors is continuing to lose market share. The competing summary only mentions that General Motors lost $10.6 billion in 2005 without mentioning the continuing market share decline. Based on the evidence presented, the consultant's summary seems more factually consistent as it includes the context of General Motors' continuing market share decline and provides a reasonable approximation of the reported loss. Final Answer: A. GM lost $10B in 2005, continues losing market share.

LLM Consultancy Judge response

For each question, we ask the LLM number_of_rounds=3 times to follow a self-consistency paradigm.

for round_number in range(number_of_rounds):
        logger.info(f"START Consultancy with Claude Debate_id {debate_id} Round #{round_number + 1} >>>>>> n")

        consultancy_response = invoke_claude_v3(debate_id = debate_id, 
                         round_number = round_number + 1,
                         question=question,
                         summary_defending = claude_defending_summary, 
                         summary_opposing = claude_opposing_summary, 
                         complete_interview = complete_interview_transcript,
                         debate=False
                         )

Technique 4: (LLM Debate)

In , we use Anthropic Claude 3 Sonnet as the first debater and Mixtral 8X7B as the second debater with Mistral 7b as the judge. We let each debater argue their side for N(=3) rounds. Each round of debate is saved in a file. For the next round, each debater continues to defend their side based on the previous round’s argument. Once N(=3) rounds are over, the judge LLM uses only these arguments to decide which side is better. Now we flip Anthropic Claude 3 Sonnet (LLM-1) and Mixtral 8X7B (LLM-2) argument sides in both of the experiments and take the average of the experiment results as final accuracy. Refer to the Evaluation section to see how we calculate this accuracy.

The visualization of the LLM debate technique is as follows:

In this image, we depict the flow of LLM Debate. First Debater LLMs like Claude and Mixtral argue their side based on transcript contents. Next each argument is saved to a file and the next debater picks up the entire argument history before posting their next argument. Finally, once all 3 rounds of arguments are over, the Judge LLM reads all the arguments and decides which summary is the most factually consistent answer.

LLM Debate Judge

Prompt template for judge LLM

mistral_debate_judge_template = """
[INST]
User: You are an expert judge in a debate where the participants are arguing over the correct answer to a reading comprehension question based on a short story. Your task is to determine which debater is correct. However, you haven’t read the story, making your task challenging. To guide you, debaters can include direct quotes from the story. Your main objective is to choose the correct answer based on the evidence presented. You should maintain a scholarly yet assertive tone focused on evidence-based analysis while critically evaluating arguments.
Here is some important information regarding the quote system which is applied after the debaters have finished their arguments:
• Verified quotes are direct string matches from the story and are enclosed in quote tags.
• Unverified quotes are not direct string matches from the story, so they should not be trusted and are enclosed in quote tags.
• Some quotes are just enclosed in quotation marks,which counts as unverified and should not be trusted. Here is the question and the two answers:
Question: "{question}"
Answer A: {answer_a}
Answer B: {answer_b}
{debater_a} is arguing for answer A, and {debater_b} is arguing for answer B.
Here’s the current transcript of the debate:

{current_debate_transcript}

Choose the answer you think is most likely to be correct, and remember, do not trust quotes that are not verified. Think about your answer in  tags by carrying out these steps:
1. Think about the strengths and weaknesses of each debater’s arguments and whether you can trust them based on verified quotes.
2. Think step by step about which answer is likely to be correct. Output your thoughts.
3. Output the answer you chose as correct.
4. Then, on a new line, format your final answer as ”Final Answer: A or B”.
[/INST]
"""
Debate IDRound #Task QuestionJudge Response 0CNN-1041291Which one of these summaries is the most factually consistent one? Answer A:GM lost $10.6B in 2005, continues losing market share. Answer B:GM lost $10B in 2005, continues losing market share.<thinking> Both debaters make valid points, but the precision of the numbers in question is a significant difference. Debater expert_debater_mixtral_8_7B correctly quotes the exact loss reported by the story, while debater expert_debater_sonnet_v3 approximates the loss with $10B. However, the story's emphasis seems to be on the dramatic decline and massive losses of GM in 2005, rather than the exact figure. Both summaries accurately reflect GM's continuing market share loss. Given this context, I believe debater expert_debater_sonnet_v3's approximation is more factually consistent with the story's key details, as it captures the essence of GM's financial struggles in 2005. </thinking> Final Answer: A Answer A: GM lost $10B in 2005, continues losing market share.

LLM Debate Judge Response

For each question, we ask the LLM number_of_rounds=3 times to follow a self-consistency paradigm.

for round_number in range(number_of_rounds):
        print(f"=========== START OF 2 model DEBATE debate_id {debate_id} Round #1..{round_number + 1} ======= n")
        logger.info(f"START Debate with Claude Debate_id {debate_id} Round #{round_number + 1} >>>>>> n") 
        claude_debate_response = invoke_claude_v3(debate_id = debate_id,
                         question=question,
                         round_number = round_number + 1,
                         summary_defending = claude_defending_summary, 
                         summary_opposing = claude_opposing_summary, 
                         complete_interview = complete_interview_transcript,
                         debate=True
                         )

        logger.info(f" >>>>> claude_debate_response Round #{round_number + 1} >>>>> {claude_debate_response}")
        logger.info(f"END Debate with Claude Round #{round_number + 1} >>>>>> n")

        mixtral_debate_response = invoke_mistral(debate_id = debate_id,
                     question=question,
                     round_number = round_number + 1,
                     summary_defending = mixtral_defending_summary, 
                     summary_opposing = mixtral_opposing_summary, 
                     complete_interview = complete_interview_transcript, 
                     )

        logger.info(f" >>>>> mixtral_debate_response Round #{round_number + 1} >>>>> {mixtral_debate_response}")
        logger.info(f"END Debate with Mixtral Round #{round_number + 1} >>>>>> n")

Evaluation Metrics

Factual Consistency Accuracy (for all techniques):

For each question in every technique, the judge chooses whether summary A or B is True. As mentioned above, we also flip the position of summary A and B and repeat the same question to the same LLM. At the end of a run, we define the factual consistency accuracy as the number of times the judge chose the same answer regardless of its position being flipped (to account for position bias, verbosity bias, or random guess).

factual_consistency_accuracy = find_number_of_matching_elements(judge_regular_answers, judge_flipped_answers)/total_data_points

Finally, we compare the accuracy of each technique against each other.

Win rate per LLM (this metric only applies to LLM debates):

For the LLM debate, we can calculate the win rate of the LLM debaters to evaluate which of the LLMs got most of the answers right as adjudicated by the judge LLM. With this win rate of expert models, we empirically understand which LLM as a debater is more successful than the other. This metric may be used to choose one LLM over the other given a particular use case and dataset.

claude_avg_win_rate, mixtral_avg_win_rate = get_win_rate_per_model(debate_judge_regular_answers, debate_judge_flipped_answers)

Details about the win rate per model can be found in the GitHub repository here.

Cost considerations

The following are important cost considerations:

Conclusion

In this post, we demonstrated how LLM debate is a technique that can improve factual consistency. While it can be expensive to use three LLMs (two debaters and one judge), a potential direction could be scaling up the ground truth curation/alignment process for unsupervised/raw datasets for fine-tuning existing LLMs and building new LLMs.

From the examples in each of the techniques, we see the interpretability and rationale used by the LLMs in getting to the final answer. The naïve judge technique establishes a lower threshold of performance whereas the LLM debate technique is the most verbose providing a detailed explanation of how it got to the final answer. The expert judge technique outperforms the naïve judge and the LLM consultancy technique does better than the expert judge as shown in the figure below.

For many repeated runs across this small subset of TofuEval dataset, we observe the LLM debating technique out-performing the other techniques mentioned in this post. One entire end-to-end run snapshot of performance is as follows:

bar graph, x = Experiment Type, y = Accuracy. Values are Naive Judge = 0.1, Expert Judge=0.4, LLM Consultancy=0.5, LLM Debate=0.7

Compare accuracies across all four techniques

Depending on the use case and dataset volume, while we can start with human annotation, it can quickly become expensive, slow, and disagreement amongst human annotators can add layers of complexity. A scalable oversight direction could be this LLM debating technique to align on the ground truth options via this debating and critique mechanism thereby establishing factual consistency. However, before scaling up this technique for your use case, it is necessary to compare the LLM debate performance against human annotation over a diverse subset of the domain-specific dataset.

Readers are highly encouraged to switch LLMs that are apt for their use case with this debating technique. LLM debates need to be calibrated and aligned with human preference for the task and dataset. You can use Amazon SageMaker Ground Truth for labeling jobs to record human preferences with their own private skilled work teams or use Amazon SageMaker Ground Truth Plus for a fully managed experience for this human alignment task.

To learn more about customizing models with Amazon Bedrock, see Customize your model to improve its performance for your use case.

Acknowledgements

The author thanks all the reviewers for their valuable feedback.


About the Author

Image of Author

Shayan Ray is an Applied Scientist at Amazon Web Services. His area of research is all things natural language (like NLP, NLU, and NLG). His work has been focused on conversational AI, task-oriented dialogue systems and LLM-based agents. His research publications are on natural language processing, personalization, and reinforcement learning.

Read More

Governing the ML lifecycle at scale, Part 3: Setting up data governance at scale

Governing the ML lifecycle at scale, Part 3: Setting up data governance at scale

This post is part of an ongoing series about governing the machine learning (ML) lifecycle at scale. To view this series from the beginning, start with Part 1. This post dives deep into how to set up data governance at scale using Amazon DataZone for the data mesh. The data mesh is a modern approach to data management that decentralizes data ownership and treats data as a product. It enables different business units within an organization to create, share, and govern their own data assets, promoting self-service analytics and reducing the time required to convert data experiments into production-ready applications. The data mesh architecture aims to increase the return on investments in data teams, processes, and technology, ultimately driving business value through innovative analytics and ML projects across the enterprise.

Organizations spanning various industries are progressively utilizing data and ML to drive innovation, enhance decision-making processes, and gain a competitive advantage. However, as data volumes and complexity continue to grow, effective data governance becomes a critical challenge. Organizations must make sure their data assets are properly managed, secured, and compliant with regulatory requirements, while also enabling seamless access and collaboration among various teams and stakeholders.

This post explores the role of Amazon DataZone, a comprehensive data management and governance service, in addressing these challenges at scale. We dive into a real-world use case from the financial services industry, where effective marketing campaigns are crucial for acquiring and retaining customers, as well as cross-selling products. By taking advantage of the data governance capabilities of Amazon DataZone, financial institutions like banks can securely access and use their comprehensive customer datasets to design and implement targeted marketing campaigns tailored to individual customer needs and preferences.

We explore the following key aspects:

  • Traditional challenges in data management and governance across multiple systems and accounts
  • The benefits of Amazon DataZone in simplifying data governance and enabling seamless data sharing
  • A detailed use case on using governed customer data for effective marketing campaigns in the banking and financial services industry
  • The reference architecture for a multi-account ML platform, highlighting the role of Amazon DataZone in the data management and governance layer
  • Step-by-step guidance on setting up Amazon DataZone in a multi-account environment, including account setup, blueprint enablement, user management, and project configuration for data publishers and subscribers

By the end of this post, you will have a comprehensive understanding of how Amazon DataZone can empower organizations to establish centralized data governance, enforce consistent policies, and facilitate secure data sharing across teams and accounts, ultimately unlocking the full potential of your data assets while maintaining compliance and security.

Challenges in data management

Traditionally, managing and governing data across multiple systems involved tedious manual processes, custom scripts, and disconnected tools. This approach was not only time-consuming but also prone to errors and difficult to scale. Organizations often struggled with the following challenges:

  • Discovering data assets scattered everywhere
  • Enforcing consistent data policies and access controls
  • Understanding data lineage and dependencies
  • A lack of centralized data governance, leading to data silos, compliance issues, and inefficient data utilization

Amazon DataZone solves these problems by providing a comprehensive solution for data management and governance:

  • You can automatically discover and catalog data assets across multiple AWS accounts and virtual private clouds (VPCs)
  • It allows you to define and enforce consistent governance policies, track data lineage, and securely share data with fine-grained access controls—all from a single platform
  • Amazon DataZone integrates with AWS Identity and Access Management (IAM) for secure access management, making sure only authorized users and applications can access data assets based on their roles and permissions
  • With Amazon DataZone, organizations gain better visibility, control, and governance over their data, enabling informed decision-making, better compliance with regulations, and unlocking the full potential of their data

Use case

In the competitive banking and financial services industry, effective marketing campaigns are crucial for acquiring new customers, retaining existing ones, and cross-selling products. With the data governance capabilities of Amazon DataZone, banks can securely access and use their own comprehensive customer datasets to design and implement targeted marketing campaigns for financial products, such as certificates of deposit, investment portfolios, and loan offerings. In this post, we discuss how banks can establish a centralized data catalog, enabling data publishers to share customer datasets and marketing teams to subscribe to relevant data using Amazon DataZone.

The following diagram gives a high-level illustration of the use case.

The diagram shows several accounts and personas as part of the overall infrastructure. In the given use case of using Amazon DataZone for effective marketing campaigns in the banking and financial services industry, the different accounts serve the following functions:

  • Management account – This account manages organization-level functions, such as defining the organizational structure, provisioning new accounts, managing identities and access (identity management), implementing security and governance best practices, and orchestrating the creation of the landing zone (a secure and compliant environment for workloads). For example, in the bank marketing use case, the management account would be responsible for setting up the organizational structure for the bank’s data and analytics teams, provisioning separate accounts for data governance, data lakes, and data science teams, and maintaining compliance with relevant financial regulations.
  • Data governance account – This account hosts the central data governance services provided by Amazon DataZone. It serves as the hub for defining and enforcing data governance policies, data cataloging, data lineage tracking, and managing data access controls across the organization. For instance, for our use case, the data governance account would be used to define and enforce policies around customer data privacy, data quality rules for customer datasets, and access controls for sharing customer data with the marketing team.
  • Data lake account (producer) – There can be one or more data lake accounts within the organization. We discuss this in more detail later in this post.
  • Data science team account (consumer) – There can be one or more data science team accounts or data consumer accounts within the organization. We provide additional information later in this post.

By separating these accounts and their responsibilities, the organization can maintain a clear separation of duties, enforce appropriate access controls, and make sure data governance policies are consistently applied across the entire data lifecycle. The data governance account, acting as the central hub, enables seamless data sharing and collaboration between the data producers (data lake accounts) and data consumers (data science team accounts), while meeting data privacy, security, and compliance requirements.

Solution overview

The following diagram illustrates the ML platform reference architecture using various AWS services. The functional architecture with different capabilities is implemented using a number of AWS services, including AWS Organizations, Amazon SageMaker, AWS DevOps services, and a data lake. For more information about the architecture in detail, refer to Part 1 of this series. In this post, we focus on the highlighted Amazon DataZone section.

solution__architecture

The data management services function is organized through the data lake accounts (producers) and data science team accounts (consumers).

The data lake accounts are responsible for storing and managing the enterprise’s raw, curated, and aggregated datasets. Data engineers and data publishers work within these accounts to ingest, process, and publish data assets that can be consumed by other teams, such as the marketing team or data science teams. In the bank marketing use case, the data lake accounts would store and manage the bank’s customer data, including raw data from various sources, curated datasets with customer profiles, and aggregated datasets for marketing segmentation.

As producers, data engineers in these accounts are responsible for creating, transforming, and managing data assets that will be cataloged and governed by Amazon DataZone. They make sure data is produced consistently and reliably, adhering to the organization’s data governance rules and standards set up in the data governance account. Data engineers contribute to the data lineage process by providing the necessary information and metadata about the data transformations they perform.

Amazon DataZone plays a crucial role in maintaining data lineage information, enabling traceability and impact analysis of data transformations across the organization. It handles the actual maintenance and management of data lineage information, using the metadata provided by data engineers to build and maintain the data lineage.

The data science team accounts are used by data analysts, data scientists, or marketing teams to access and consume the published data assets from the data lake accounts. Within these accounts, they can perform analyses, build models, or design targeted marketing campaigns by using the governed and curated datasets made available through the data sharing and access control mechanisms of Amazon Data Zone. For example, in the bank marketing use case, the data science team accounts would be used by the bank’s marketing teams to access and analyze customer datasets, build predictive models for targeted marketing campaigns, and design personalized financial product offerings based on the shared customer data.

Using Amazon DataZone in a multi-account ML platform

You can find practical, step-by-step instructions for implementing this setup in module 2 of this AWS Multi-Account Data & ML Governance Workshop.  This workshop provides detailed guidance on setting up Amazon DataZone in the central governance account.

Conclusion

Effective governance is crucial for organizations to unlock their data’s potential while maintaining compliance and security. Amazon DataZone provides a comprehensive solution for data management and governance at scale, automating complex tasks like data cataloging, policy enforcement, lineage tracking, and secure data sharing.

As demonstrated in the financial services use case, Amazon DataZone empowers organizations to establish a centralized data catalog, enforce consistent governance policies, and facilitate secure data sharing between data producers and consumers. Financial institutions can use Amazon DataZone to gain a competitive edge by designing and implementing effective, tailored marketing campaigns while adhering to data privacy and compliance regulations.

The multi-account ML platform architecture, combined with Amazon DataZone and other AWS services, provides a scalable and secure foundation for governing data and ML workflows effectively. By following the outlined steps, you can streamline the setup and management of Amazon DataZone, enabling seamless collaboration between stakeholders involved in the data and ML lifecycle.

As data generation and utilization continue to grow, robust data governance solutions become paramount. Amazon DataZone offers a powerful approach to data management and governance, empowering organizations to unlock their data’s true value while maintaining the highest standards of security, compliance, and data privacy.


About the Authors

Ajit Mungale is a Senior Solutions Architect at Amazon Web Services with specialization in AI/ML/Generative AI, IoT and .Net technologies. At AWS, he helps customers build, migrate, and create new cost effective cloud solutions. He possesses extensive experience in developing distributed applications and has worked with multiple cloud platforms. With his deep technical knowledge and business understanding, Ajit guides organizations in leveraging the full capabilities of the cloud.

Ram Vittal is a Principal Generative AI Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his sheep-a-doodle!

Read More