Why Workforce Development Is Key to Reaping AI Benefits

Why Workforce Development Is Key to Reaping AI Benefits

AI is changing industries and economies worldwide.

Workforce development is central to ensuring the changes benefit all of us, as Louis Stewart, head of strategic initiatives for NVIDIA’s global developer ecosystem, explains in the latest AI Podcast.

“AI is fueling a lot of change in all ecosystems right now,” Stewart said. “It’s disrupting how we think about traditional economic development — how states and countries plan, how they stay competitive globally, and how they develop their workforces.”

Providing AI education, embracing the technology and addressing workforce challenges are all critical for future workplace development.

“It starts with education,” Stewart said

AI Education Crucial at All Levels

Educating people on what AI can do, and how the current generation of AI-powered tools work, is the starting point. AI education must come at all levels, according to Stewart — however, higher education systems, in particular, need to be thinking about what’s coming next, so graduating students can optimize their employability.

“Graduates need to understand AI, and need to have had touches in AI,” he explained. Stewart emphasizes that this is broader than an engineering or a research challenge. “This is really a true workforce issue.”

Stewart points to Gwinnett County in Georgia as an early education example, where the community has developed a full K-16 curriculum.

“If young kids are already playing with AI on their phones, they should actually be thinking about it a little bit deeper,” he said. The idea, he explained, is for kids to move beyond simply using the tech to start seeing themselves as future creators of new technology, and being part of the broader evolution.

Nobody Gets Left Out 

Beyond the classroom, a comprehensive view of AI education would expose people in the broader community to AI learning opportunities, Stewart said. His experience in the public sector informs his decidedly inclusive view on the matter.

Before joining NVIDIA four years ago, Stewart spent more than a decade working for the state of California, and then its capital city of Sacramento. He points to his time as Sacramento’s chief innovation officer to illustrate how important it is that all citizens be included in progress.

“Sacramento was trying to move into a place to be an innovation leader in the state and nationally. I knew the city because I grew up here, and I knew that there were areas of the city that would never see innovation unless it was brought to them,” he explained. “So if I was bringing autonomous cars to Sacramento, it was for the legislators, and it was for the CHP (California Highway Patrol), but it was also for the people.”

Stewart elaborated that everyone coming in touch with self-driving vehicles needed to understand their impact. There was the technology itself — how autonomous vehicles work, how to use them as a passenger and so forth.

But there were also broader questions, such as how mechanics would need new training to understand the computer systems powering autonomous cars. And how parents would need to understand self-driving vehicles from the point of view of getting their kids to and from school without having to miss work to do the driving themselves.

Just as individuals will have different needs and wants from AI systems, so too will different communities, businesses and states take different approaches when implementing AI, Stewart said.

Diverse Approaches to AI Implementation

Public-private partnerships are critical to implementing AI across the U.S. and beyond. NVIDIA is partnering with states and higher education systems across the country for AI workforce development. And the programs being put in place are just as diverse as the states themselves.

“Every state has their idea about what they want to do when it comes to AI,” Stewart explained.

Still, some common goals hold across state lines. When Stewart’s team engages a governor’s office with talk of AI to empower the workforce, create job opportunities, and improve collaboration, inclusivity and growth, he finds that state officials listen.

Stewart added that they often open up about what they’ve been working on. “We’ve been pleasantly surprised at how far along some of the states are with their AI strategies,” he said.

In August, NVIDIA announced it is working with the state of California to train 100,000 people on AI skills over the next three years. It’s an undertaking that will involve all 116 of the state’s community colleges and California’s university system. NVIDIA will also collaborate with the California human resources system to help state employees understand how AI skills may be incorporated into current and future jobs.

In Mississippi, a robust AI strategy is already in place.

The Mississippi Artificial Intelligence Network (MAIN) is one of the first statewide initiatives focused on addressing the emergence of AI and its effects on various industries’ workforces. MAIN works with educational partners that include community colleges and universities in Mississippi, all collaborating to facilitate AI education and training.

Embrace Technology, Embrace the Future

Stewart said it’s important to encourage individuals, businesses and other organizations to actively engage with AI tools and develop an understanding of how they’re benefiting the workforce landscape.

“Now is not the time to stay on the sidelines,” said Stewart.“This is the time to jump in and start understanding.”

Small businesses, for example, can start using applications like ChatGPT to see firsthand how they can transform operations. From there, Stewart suggests, a business could partner with the local school system to empower student interns to develop AI-powered tools and workflows for data analysis, marketing and other needs.

It’s a win-win: The business can transform itself with AI while playing a crucial part in developing the workforce by giving students valuable real-world experience.

It’s crucial that people get up to speed on the changes that AI is driving. And that we all participate in shaping our collective future, Stewart explained.

“Workforce development is, I think, at the crux of this next part of the conversation because the innovation and the research and everything surrounding AI is driving change so rapidly,” he said.

Hear more from NVIDIA’s Louis Stewart on workforce development opportunities in the latest AI Podcast.

Read More

LazyGraphRAG: Setting a new standard for quality and cost

LazyGraphRAG: Setting a new standard for quality and cost

A white geometric network icon with interconnected nodes and lines is centered on a gradient background transitioning from blue on the left to pink on the right.

Affordable GraphRAG for every use case

The GraphRAG project (opens in new tab) aims to expand the class of questions that AI systems can answer over private datasets by leveraging the implicit relationships within unstructured text. 

A key advantage of GraphRAG over conventional vector RAG (or “semantic search”) is its ability to answer global queries that address the entire dataset, such as “what are the main themes in the data?”, or “what are the most important implications for X?”. Conversely, vector RAG excels for local queries where the answer resembles the query and can be found within specific text regions, as is typically the case for “who”, “what”, “when”, and “where” questions. 

In recent blog posts, we have shared two new query mechanisms that exploit the rich, summary-based data index created by GraphRAG to improve local search performance and global search costs, respectively. 

In this blog post, we introduce a radically different approach to graph-enabled RAG that requires no prior summarization of the source data, avoiding the up-front indexing costs that may be prohibitive for some users and use cases. We call this approach “LazyGraphRAG”. 

A key advantage of LazyGraphRAG is its inherent scalability in terms of both cost and quality. Across a range of competing methods (standard vector RAG, RAPTOR (opens in new tab), and GraphRAG local (opens in new tab), global (opens in new tab), and DRIFT (opens in new tab) search mechanisms), LazyGraphRAG shows strong performance across the cost-quality spectrum as follows: 

  • LazyGraphRAG data indexing costs are identical to vector RAG and 0.1% of the costs of full GraphRAG. 
  • For comparable query costs to vector RAG, LazyGraphRAG outperforms all competing methods on local queries, including long-context vector RAG and GraphRAG DRIFT search (our recently introduced RAG approach shown to outperform vector RAG) as well as GraphRAG local search. 
  • The same LazyGraphRAG configuration also shows comparable answer quality to GraphRAG Global Search for global queries, but more than 700 times lower query cost.
  • For 4% of the query cost of GraphRAG global search, LazyGraphRAG significantly outperforms all competing methods on both local and global query types, including GraphRAG global search at the C2 level (the third level of the community hierarchy recommended for most applications).

LazyGraphRAG is coming soon to our open-source GraphRAG library (opens in new tab), providing a unified query interface for both local and global queries over a lightweight data index that is comparable in cost to standard vector RAG. 

Blending vector RAG and Graph RAG with deferred LLM use

LazyGraphRAG aims to blend the advantages of vector RAG and Graph RAG while overcoming their respective limitations: 

  • Vector RAG is a form of best-first search that uses the similarity with the query to select the best-matching source text chunks. However, it has no sense of the breadth of the dataset to consider for global queries. 
  • GraphRAG global search is a form of breadth-first search that uses the community structure of source text entities to ensure that queries are answered considering the full breadth of the dataset. However, it has no sense of the best communities to consider for local queries.

LazyGraphRAG combines best-first and breadth-first search dynamics in an iterative deepening manner (Table 1). Compared to the global search mechanism of full GraphRAG, this approach is “lazy” in ways that defer LLM use and dramatically increase the efficiency of answer generation. Overall performance can be scaled via a single main parameter – the relevance test budget – that controls the cost-quality trade-off in a consistent manner.

GraphRAG LazyGraphRAG
Build index Uses an LLM to extract and describe entities and their relationships, b) uses an LLM to summarize all observations of each entity and relationship, c) uses graph statistics to optimize the entity graph and extract hierarchical community structure Uses NLP noun phrase extraction to extract concepts and their co-occurrences, b) uses graph statistics to optimize the concept graph and extract hierarchical community structure
Summarize index Uses an LLM to summarize entities and relationships in each community None – the “lazy” approach defers all LLM use until query time
Refine query None – the original query is used throughout Uses an LLM to a) identify relevant subqueries and recombine them into a single expanded query, b) refine subqueries with matching concepts from the concept graph
Match query None – all queries are answered using all community summaries (breadth first For each of q subqueries [3-5]: 
– Uses text chunk embeddings and chunk-community relationships to first rank text chunks by similarity to the query, then rank communities by the rank of their top-k text chunks (best first
– Uses an LLM-based sentence-level relevance assessor to rate the relevance of the top-k untested text chunks from communities in rank order (breadth first
– Recurses into sub-communities after z successive communities yield zero relevant text chunks, or relevance test budget / q is reached (iterative deepening
Map answers Uses an LLM to answer the original query over random batches of community summaries in parallel For each of q subqueries [3-5]: 
– Builds a subgraph of concepts from the relevant text chunks 
– Uses the community assignments of concepts to group related chunks together 
– Uses an LLM to extract subquery-relevant claims from groups of related chunks as a way of focusing on relevant content only 
– Ranks and filters extracted claims to fit a pre-defined context window size 
Reduce answers Uses an LLM to answer the original query using the mapped answers Uses an LLM to answer the expanded query using the extracted map claims

LazyGraphRAG answer quality is state-of-the-art

We compared LazyGraphRAG at varying levels of relevance test budget to a range of competing methods, as follows: 

  • Dataset: 5,590 AP news articles (used with license)
  • Queries: 100 synthetic queries (50 local and 50 global), generated using a new method to be described in a future blog post
  • Metrics: Comprehensiveness, Diversity, Empowerment (as described here, with an LLM used to compare pairs of answers head-to-head on each metric)
  • Conditions: Includes LazyGraphRAG with three relevance test budget settings, in addition to eight competing conditions from GraphRAG and literature (Table 2).
Condition Description
Z100_Lite LazyGraphRAG with a relevance test budget of 100 and using a low-cost LLM model at all steps 
Z500 LazyGraphRAG with a relevance test budget of 500, using a low-cost LLM for relevance tests and a more advanced (higher cost) LLM for query refinement and map/reduce answer generation 
Z1500 LazyGraphRAG with a relevance test budget of 1,500, using a low-cost LLM for relevance tests and a more advanced (higher cost) LLM for query refinement and map/reduce answer generation 
C1 GraphRAG Global Search at community level 1 
C2 GraphRAG Global Search at community level 2
C3_Dynamic GraphRAG Global Search at community level 3 using dynamic community selection (a cost saving solution introduced in our previous blog post)
LS GraphRAG Local Search
DRIFT GraphRAG DRIFT Search
SS_8K Semantic Search with vector RAG retrieving 8k tokens of input 
SS_64K Semantic Search with vector RAG retrieving 64k tokens of input (included to examine the effect of long-context window on vector RAG) 
RAPTOR A hierarchical clustering-based approach to RAG that shares some similar features to LazyGraphRAG 

Figure 1 shows LazyGraphRAG win rates against each of the eight competing conditions. At the lowest budget level of 100 relevance tests using a low-cost LLM model (same cost as SS_8K), LazyGraphRAG significantly outperforms all conditions on local and global queries, except for GraphRAG global search conditions for global queries. At an increased budget of 500 relevance tests using a more advanced LLM model (4% of the query cost of C2), LazyGraphRAG significantly outperforms all conditions on both local and global queries. LazyGraphRAG’s win rates continue to increase as the relevance test budget increases to 1,500, demonstrating its scalability in terms of cost vs. quality. 

Figure 1. Clustered bar charts showing win rates of LazyGraphRAG with relevance test budgets of (100, 500, 1,500) over competing conditions for each combination of (local, global) queries and (comprehensiveness, diversity, empowerment) metrics.
Figure 1. Clustered bar charts showing win rates of LazyGraphRAG with relevance test budgets of (100, 500, 1,500) over competing conditions for each combination of (local, global) queries and (comprehensiveness, diversity, empowerment) metrics

An example comparison is shown in Table 3.

Query: Describe what factors individuals in the United States should consider when selecting a health insurance plan during the open enrollment period for 2024 health insurance plans.

Vector RAG (64K context) LazyGraphRAG
# Factors to Consider When Selecting a Health Insurance Plan During Open Enrollment for 2024 
 
Selecting a health insurance plan during the open enrollment period can be a complex process, but understanding the key factors to consider can help individuals make informed decisions. Here are some important aspects to keep in mind: 
 
## Coverage and Networks 
 
One of the primary considerations is the coverage network of the plan. Insurers form coverage networks of hospitals and doctors, and it is crucial to ensure that your preferred healthcare providers are included in the network of the plan you are considering. Many marketplace plans, such as Health Maintenance Organizations (HMOs) and Exclusive Provider Organizations (EPOs), do not cover claims for most non-emergency care outside their networks [Data: Sources (47419, 47421, 47420, 47410, 47408)]. 
 
## Costs: Premiums, Deductibles, and Out-of-Pocket Maximums 
 
Understanding the costs associated with a health insurance plan is essential. This includes the monthly premiums, deductibles, copays, and coinsurance. Plans are typically categorized into different levels such as platinum, gold, silver, and bronze, with bronze plans generally having lower premiums but higher out-of-pocket costs [Data: Sources (47411, 47412, 47416, 47414, 47418)]. Additionally, it is important to consider the out-of-pocket maximum, which is the most you will have to pay in a year for covered services before the insurance starts covering all costs [Data: Sources (47416, 47415)]. 
 
## Tax Credits and Cost-Sharing Reductions 
 
Individuals with lower incomes may qualify for income-based tax credits to help reduce the cost of premiums. It is important to accurately estimate your income for the coming year when applying for these credits to avoid having to repay any excess credits received [Data: Sources (47412, 47413)]. Additionally, those with incomes below 250% of the federal poverty level may qualify for cost-sharing reductions if they select a silver plan, which can lower deductibles and out-of-pocket maximums [Data: Sources (47417, 47418)]. 
 
## Special Considerations for Medicaid and Medicare 
 
For those transitioning from Medicaid, it is important to understand that marketplace plans may come with higher premiums and copays compared to Medicaid. Additionally, many people who have been removed from Medicaid may be eligible to enroll in marketplace plans through an extended enrollment period [Data: Sources (6284, 6285, 6286, 6287)]. For seniors, the Medicare open enrollment period allows for switching between traditional Medicare and Medicare Advantage plans, with considerations for prescription drug coverage and additional benefits like dental and vision [Data: Sources (72060, 72061, 72062, 72063, 72064)]. 
 
## Assistance and Resources 
 
Navigating the complexities of health insurance can be challenging, and seeking assistance from insurance agents or government-funded navigators can be beneficial. These professionals can help individuals understand their options and make informed decisions without recommending specific plans [Data: Sources (47424, 47425, 47426)]. 
 
## Conclusion 
 
Selecting the right health insurance plan requires careful consideration of various factors including coverage networks, costs, eligibility for tax credits and cost-sharing reductions, and available assistance. By thoroughly evaluating these aspects, individuals can choose a plan that best meets their healthcare needs and financial situation for the upcoming year.
### Types of Health Insurance Plans 
 
When selecting a health insurance plan during the open enrollment period for 2024, individuals in the United States have several options to consider: 
 
1. **Marketplace Plans**: These are available under the Affordable Care Act (ACA) and are categorized into platinum, gold, silver, and bronze levels. Bronze plans generally have lower premiums but higher out-of-pocket costs, while platinum plans have higher premiums but lower out-of-pocket costs [Data: Sources (47411, 47420, 47421)]. 
 
2. **Low-Cost Marketplace Plans**: These often come with high deductibles, requiring patients to pay thousands of dollars before most coverage kicks in. They also have annual out-of-pocket maximums that can exceed $9,000 for individuals and $18,000 for families [Data: Sources (47415, 47416, 47414)]. 
 
3. **Exclusive Provider Organizations (EPOs) and Health Maintenance Organizations (HMOs)**: EPOs and HMOs generally restrict patients to a network of doctors and require a primary care doctor to direct care. They tend to be cheaper but lack out-of-network flexibility [Data: Sources (47420, 47421, 43218, 43217)]. 
 
4. **Preferred Provider Organizations (PPOs)**: These plans allow for out-of-network services but at a higher cost. They offer more flexibility compared to HMOs and EPOs [Data: Sources (43217)]. 
 
5. **High-Deductible Health Plans (HDHPs)**: Defined as plans with a deductible of at least $1,600 for individual coverage or $3,200 for family coverage, with out-of-pocket maximums of no more than $8,050 or $16,100, respectively. HDHPs usually have lower premiums, and sometimes companies contribute to a health savings account (HSA) to help cover the deductible [Data: Sources (43227, 43226)]. 
 
6. **Medicare Advantage**: These are privately run versions of the federal government’s Medicare program, mostly for people aged 65 and over. They often include prescription drug coverage and may offer additional benefits like dental or vision coverage not provided by traditional Medicare [Data: Sources (72063, 72061, 72060, 72062)]. 
 
7. **Short-Term Health Insurance Plans**: These plans are limited to three months and can only be renewed for a maximum of four months under new rules. They are intended for temporary coverage but often lack comprehensive benefits [Data: Sources (97999, 97995, 97996, 97997)]. 
 
### Cost Factors: Premiums, Deductibles, Co-pays, and Out-of-Pocket Maximums 
 
The overall cost of health insurance plans in 2024 is influenced by several factors: 
 
– **Premiums**: This is the set monthly cost you pay for your health insurance plan. Premiums have been rising, with a notable increase of 7% for both family and single plans in 2023, partly due to inflation [Data: Sources (83383, 83382, 83384, 83385, 83381, +more)]. 
 
– **Deductibles**: The amount you pay out-of-pocket for health care services before your insurance starts to pay. For HDHPs, the deductible is at least $1,600 for individual coverage or $3,200 for family coverage [Data: Sources (43226, 43225)]. 
 
– **Co-pays and Co-insurance**: These are the costs you pay each time you receive a medical service. Co-pays are fixed amounts, while co-insurance is a percentage of the service cost. 
 
– **Out-of-Pocket Maximums**: This is the maximum amount you will pay for covered services in a year. For example, HDHPs have out-of-pocket maximums of no more than $8,050 for individual coverage or $16,100 for family coverage [Data: Sources (43227, 43226)]. 
 
### Provider Networks: In-Network vs. Out-of-Network 
 
The network of healthcare providers is a crucial factor in selecting a health insurance plan: 
 
– **In-Network Providers**: These are doctors and hospitals that have agreements with your insurance plan to provide services at lower rates. Ensuring your preferred doctors and specialists are in-network can save you significant costs [Data: Sources (43216, 47419)]. 
 
– **Out-of-Network Providers**: Services from these providers are usually more expensive and may not be covered at all, except in emergencies. PPO plans offer some out-of-network coverage but at a higher cost, while HMOs and EPOs generally do not cover non-emergency out-of-network care [Data: Sources (43217, 47421)]. 
 
### Specific Medical Needs and Services 
 
When selecting a health insurance plan, individuals should consider their specific medical needs: 
 
– **Prescription Drugs**: Ensure that your medications are covered by the plan’s formulary, as drug coverage can change annually [Data: Sources (43220, 43218, 43219)]. 
 
– **Mental Health Services**: Coverage for mental health treatments is essential, especially with new rules pushing insurers to increase their coverage of these services [Data: Sources (97031, 97028, 97027, 97030, 97033, +more)]. 
 
– **Chronic Conditions**: Plans should cover ongoing treatments and medications for chronic conditions. Medicare Supplement Insurance (Medigap) can help cover gaps in Medicare for chronic disease management [Data: Sources (93367, 93368)]. 
 
– **Preventive Care**: Coverage for preventive services like cancer screenings and HIV prevention is mandated under the ACA, though its future is uncertain due to ongoing legal battles [Data: Sources (71106, 71109, 71098, 71099, 71100, +more)]. 
 
### Key Dates and Steps for Open Enrollment 
 
The open enrollment period for 2024 health insurance plans involves several key dates and steps: 
 
– **Marketplace Plans**: Open enrollment starts on November 1, 2023, and runs through mid-December in most states, ending on January 16, 2024 [Data: Sources (47419, 47411, 47416, 47421, 47409, +more)]. 
 
– **Medicare**: Open enrollment for Medicare runs from October 15, 2023, to December 7, 2023. During this period, individuals can choose between traditional Medicare, Medicare Advantage plans, and prescription drug plans [Data: Sources (72061, 72063, 72060, 72062)]. 
 
– **Special Enrollment Periods**: Individuals who lose coverage due to life events like job loss or moving may qualify for special enrollment periods. For example, those removed from Medicaid may enroll in marketplace plans through July 2024 [Data: Sources (6288, 6289)]. 
 
By considering these factors, individuals can make informed decisions about their health insurance coverage for 2024, ensuring they select plans that best meet their medical needs and financial situations.

Looking forward

LazyGraphRAG shows that it is possible for a single, flexible query mechanism to substantially outperform a diverse range of specialized query mechanisms across the local-global query spectrum, and to do so without the up-front costs of LLM data summarization. Its very fast and almost-free indexing make LazyGraphRAG ideal for one-off queries, exploratory analysis, and streaming data use cases, while its ability to smoothly increase answer quality with increasing relevance test budget makes it a valuable tool for benchmarking RAG approaches in general (e.g., “RAG approach X beats LazyGraphRAG with budget Y for task Z”). 

Does this mean that all graph-enabled RAG should be lazy? We believe the answer is no, for three reasons: 

  1. A GraphRAG data index of entity, relationship, and community summaries has use value beyond question answering (e.g., reading and sharing as reports). 
  2. A GraphRAG data index of entity, relationship, and community summaries, combined with a LazyGraphRAG-like search mechanism, is likely to achieve better results than LazyGraphRAG alone. 
  3. A new kind of GraphRAG data index designed to support a LazyGraphRAG-like search mechanism (e.g., through pre-emptive claim and topic extraction) is likely to achieve the best possible results. 

We will be exploring these directions in the coming period, with all advances (including LazyGraphRAG itself) released via the GraphRAG GitHub repository (opens in new tab). Stay tuned! 

The post LazyGraphRAG: Setting a new standard for quality and cost appeared first on Microsoft Research.

Read More

How 123RF saved over 90% of their translation costs by switching to Amazon Bedrock

How 123RF saved over 90% of their translation costs by switching to Amazon Bedrock

In the rapidly evolving digital content industry, multilingual accessibility is crucial for global reach and user engagement. 123RF, a leading provider of royalty-free digital content, is an online resource for creative assets, including AI-generated images from text. In 2023, they used Amazon OpenSearch Service to improve discovery of images by using vector-based semantic search. Building on this success, they have now implemented Amazon Bedrock and Anthropic’s Claude 3 Haiku to improve their content moderation a hundredfold and more sped up content translation to further enhance their global reach and efficiency.

Although the company achieved significant success among English-speaking users with its generative AI-based semantic search tool, it faced content discovery challenges in 15 other languages because of English-only titles and keywords. The cost of using Google Translate for continuous translations was prohibitive, and other models such as Anthropic’s Claude Sonnet and OpenAI GPT-4o weren’t cost-effective. Although OpenAI GPT-3.5 met cost criteria, it struggled with consistent output quality. This prompted 123RF to search for a more reliable and affordable solution to enhance multilingual content discovery.

This post explores how 123RF used Amazon Bedrock, Anthropic’s Claude 3 Haiku, and a vector store to efficiently translate content metadata, significantly reduce costs, and improve their global content discovery capabilities.

The challenge: Balancing quality and cost in mass translation

After implementing generative AI-based semantic search and text-to-image generation, they saw significant traction among English-speaking users. This success, however, cast a harsh light on a critical gap in their global strategy: their vast library of digital assets—comprising millions of images, audio files, and motion graphics—needed a similar overhaul for non-English speaking users.

The crux of the problem lay in the nature of their content. User-generated titles, keywords, and descriptions—the lifeblood of searchability in the digital asset world—were predominantly in English. To truly serve a global audience and unlock the full potential of their library, 123RF needed to translate this metadata into 15 different languages. But as they quickly discovered, the path to multilingual content was filled with financial and technical challenges.

The translation conundrum: Beyond word-for-word

image explaining idioms

Idioms don’t always translate well

As 123RF dove deeper into the challenge, they uncovered layers of complexity that went beyond simple word-for-word translation. The preceding figure shows one particularly difficult example: idioms. Phrases like “The early bird gets the worm” being literally translated would not convey the meaning of the word as well as another similar idiom in Spanish, “A quien madruga, Dios le ayuda”. Another significant hurdle was named entity resolution (NER)—a critical aspect for a service dealing with diverse visual and audio content.

NER involves correctly identifying and handling proper nouns, brand names, specific terminology, and culturally significant references across languages. For instance, a stock photo of the Eiffel Tower should retain its name in all languages, rather than being literally translated. Similarly, brand names like Coca-Cola or Nike should remain unchanged, regardless of the target language.

This challenge is particularly acute in the realm of creative content. Consider a hypothetical stock image titled Young woman using MacBook in a Starbucks. An ideal translation system would need to do the following:

  • Recognize MacBook and Starbucks as brand names that should not be translated
  • Correctly translate Young woman while preserving the original meaning and connotations
  • Handle the preposition in appropriately, which might change based on the grammatical rules of the target language
  • Moreover, the system needed to handle industry-specific jargon, artistic terms, and culturally specific concepts that might not have direct equivalents in other languages. For instance, how would one translate bokeh effect into languages where this photographic term isn’t commonly used?

These nuances highlighted the inadequacy of simple machine translation tools and underscored the need for a more sophisticated, context-aware solution.

Turning to language models: Large models compared to small models

In their quest for a solution, 123RF explored a spectrum of options, each with its own set of trade-offs:

  • Google Translate – The incumbent solution offered reliability and ease of use. However, it came with a staggering price tag. The company had to clear their backlog of 45 million translations. Adding to this, there was an ongoing monthly financial burden for new content that their customers generated. Though effective, this option threatened to cut into 123RF’s profitability, making it unsustainable in the long run.
  • Large language models – Next, 123RF turned to cutting-edge large language models (LLMs) such as OpenAI GPT-4 and Anthropic’s Claude Sonnet. These models showcased impressive capabilities in understanding context and producing high-quality translations. However, the cost of running these sophisticated models at 123RF’s scale proved prohibitive. Although they excelled in quality, they fell short in cost-effectiveness for a business dealing with millions of short text snippets.
  • Smaller models – In an attempt to find a middle ground, 123RF experimented with less capable models such as OpenAI GPT-3.5. These offered a more palatable price point, aligning better with 123RF’s budget constraints. However, this cost savings came at a price: inconsistency in output quality. The translations, although sometimes acceptable, lacked the reliability and nuance required for professional-grade content description.
  • Fine-tuning – 123RF briefly considered fine-tuning a smaller language model to further reduce cost. However, they understood there would be a number of hurdles: they would have to regularly fine-tune models as new model updates occur, hire subject matter experts to train the models and manage their upkeep and deployment, and potentially manage a model for each of the output languages.

This exploration laid bare a fundamental challenge in the AI translation space: the seemingly unavoidable trade-off between cost and quality. High-quality translations from top-tier models were financially unfeasible, whereas more affordable options couldn’t meet the standard of accuracy and consistency that 123RF’s business demanded.

Solution: Amazon Bedrock, Anthropic’s Claude 3 Haiku, prompt engineering, and a vector store

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Throughout this transformative journey, Amazon Bedrock proved to be the cornerstone of 123RF’s success. Several factors contributed to making it the provider of choice:

  • Model variety – Amazon Bedrock offers access to a range of state-of-the-art language models, allowing 123RF to choose the one best suited for their specific needs, like Anthropic’s Claude 3 Haiku.
  • Scalability – The ability of Amazon Bedrock to handle massive workloads efficiently was crucial for processing millions of translations.
  • Cost-effectiveness – The pricing model of Amazon Bedrock, combined with its efficient resource utilization, played a key role in achieving the dramatic cost reduction.
  • Integration capabilities – The ease of integrating Amazon Bedrock with other AWS services facilitated the implementation of advanced features such as a vector database for dynamic prompting.
  • Security and compliance – 123RF works with user-generated content, and the robust security features of Amazon Bedrock provided peace of mind in handling potentially sensitive information.
  • Flexibility for custom solutions – The openness of Amazon Bedrock to custom implementations, such as the dynamic prompting technique, allowed 123RF to tailor the solution precisely to their needs

Cracking the code: Prompt engineering techniques

The first breakthrough in 123RF’s translation journey came through a collaborative effort with the AWS team, using the power of Amazon Bedrock and Anthropic’s Claude 3 Haiku. The key to their success lay in the innovative application of prompt engineering techniques—a set of strategies designed to coax the best performance out of LLMs, especially important for cost effective models.

Prompt engineering is crucial when working with LLMs because these models, while powerful, can produce non-deterministic outputs—meaning their responses can vary even for the same input. By carefully crafting prompts, we can provide context and structure that helps mitigate this variability. Moreover, well-designed prompts serve to steer the model towards the specific task at hand, ensuring that the LLM focuses on the most relevant information and produces outputs aligned with the desired outcome. In 123RF’s case, this meant guiding the model to produce accurate, context-aware translations that preserved the nuances of the original content.

Let’s dive into the specific techniques employed.

Assigning a role to the model

The team began by assigning the AI model a specific role—that of an AI language translation assistant. This seemingly simple step was crucial in setting the context for the model’s task. By defining its role, the model was primed to approach the task with the mindset of a professional translator, considering nuances and complexities that a generic language model might overlook.

For example:

You are an AI language translation assistant. 
Your task is to accurately translate a passage of text from English into another specified language.

Separation of data and prompt templates

A clear delineation between the text to be translated and the instructions for translation was implemented. This separation served two purposes:

  • Provided clarity in the model’s input, reducing the chance of confusion or misinterpretation
  • Allowed for simpler automation and scaling of the translation process, because the same prompt template could be used with different input texts

For example:

Here is the text to translate:
<text> {{TEXT}} </text>
Please translate the above text into this language: {{TARGET_LANGUAGE}}

Chain of thought

One of the most innovative aspects of the solution was the implementation of a scratchpad section. This allowed the model to externalize its thinking process, mimicking the way a human translator might work through a challenging passage.

The scratchpad prompted the model to consider the following:

  • The overall meaning and intent of the passage
  • Idioms and expressions that might not translate literally
  • Tone, formality, and style of the writing
  • Proper nouns such as names and places that should not be translated
  • Grammatical differences between English and the target language
  • This step-by-step thought process significantly improved the quality and accuracy of translations, especially for complex or nuanced content.

K-shot examples

The team incorporated multiple examples of high-quality translations directly into the prompt. This technique, known as K-shot learning, provided the model with a number (K) of concrete examples in the desired output quality and style.

By carefully selecting diverse examples that showcased different translation challenges (such as idiomatic expressions, technical terms, and cultural references), the team effectively trained the model to handle a wide range of content types.

For example:

Examples:
<text>The early bird catches the worm.</text>
<translated_text>El que madruga, Dios le ayuda.</translated_text>

The magic formula: Putting it all together

The culmination of these techniques resulted in a prompt template that encapsulated the elements needed for high-quality, context-aware translation. The following is an example prompt with the preceding steps. The actual prompt used is not shown here.

You are an AI language translation assistant. Your task is to accurately translate a passage of text from English into another specified language. Here is the text to translate:
<text> {{TEXT}} </text>
Please translate the above text into this language: {{TARGET_LANGUAGE}}
Think carefully, in the <scratchpad> section below, think through how you will translate the text while preserving its full meaning and nuance. Consider:
- The overall meaning and intent of the passage
- Idioms and expressions that may not translate literally
- Tone, formality, and style of the writing
- Proper nouns like names and places that should not be translated
- Grammatical differences between English and {{TARGET_LANGUAGE}}
Examples:
<text>The software update is scheduled for next Tuesday.</text>
<translated_text>La actualización del software está programada para el próximo martes.</translated_text>
<text>Breaking news: Elon Musk acquires Twitter for $44 billion.</text>
<translated_text>Última hora: Elon Musk adquiere Twitter por 44 mil millones de dólares.</translated_text>
... [8 more diverse examples] ...
Now provide your final translated version of the text inside <translated_text> tags. Ensure the translation is as accurate and natural-sounding as possible in {{TARGET_LANGUAGE}}. Do not translate any names, places or other proper nouns.
<translated_text>

This template provided a framework for consistent, high-quality translations across a wide range of content types and target languages.

Further refinement: Dynamic prompting for grounding models

Although the initial implementation yielded impressive results, the AWS team suggested further enhancements through dynamic prompting techniques. This advanced approach aimed to make the model even more adaptive and context aware. They adopted the Retrieval Augmented Generation (RAG) technique for creating a dynamic prompt template with K-shot examples relevant to each phrase rather than generic examples for each language. This also allowed 123RF to take advantage of their current catalog of high quality translations to further align the model.

Vector database of high-quality translations

The team proposed creating a vector database for each target language, populated with previous high-quality translations. This database would serve as a rich repository of translation examples, capturing nuances and domain-specific terminologies.

The implementation included the following components:

  1. Embedding generation:
    • Use embedding models such as Amazon Titan or Cohere’s offerings on Amazon Bedrock to convert both source texts and their translations into high-dimensional vectors.
  2. Chunking strategy:
    • To maintain context and ensure meaningful translations, the team implemented a careful chunking strategy:
      1. Each source text (in English) was paired with its corresponding translation in the target language.
      2. These pairs were stored as complete sentences or logical phrases, rather than individual words or arbitrary character lengths.
      3. For longer content, such as paragraphs or descriptions, the text was split into semantically meaningful chunks, ensuring that each chunk contained a complete thought or idea.
      4. Each chunk pair (source and translation) was assigned a unique identifier to maintain the association.
  3. Vector storage:
    • The vector representations of both the source text and its translation were stored together in the database.
    • The storage structure included:
      1. The original source text chunk.
      2. The corresponding translation chunk.
      3. The vector embedding of the source text.
      4. The vector embedding of the translation.
      5. Metadata such as the content type, domain, and any relevant tags.
  4. Database organization:
    • The database was organized by target language, with separate indices or collections for each language pair (for example, English-Spanish and English-French).
    • Within each language pair, the vector pairs were indexed to allow for efficient similarity searches.
  5. Similarity search:
    • For each new translation task, the system would perform a hybrid search to find the most semantically similar sentences from the vector database:
      1. The new text to be translated was converted into a vector using the same embedding model.
      2. A similarity search was performed in the vector space to find the closest matches in the source language.
      3. The corresponding translations of these matches were retrieved, providing relevant examples for the translation task.

This structured approach to storing and retrieving text-translation pairs allowed for efficient, context-aware lookups that significantly improved the quality and relevance of the translations produced by the LLM.

Putting it all together

The top matching examples from the vector database would be dynamically inserted into the prompt, providing the model with highly relevant context for the specific translation task at hand.

This offered the following benefits:

  • Improved handling of domain-specific terminology and phraseology
  • Better preservation of style and tone appropriate to the content type
  • Enhanced ability to resolve named entities and technical terms correctly

The following is an example of a dynamically generated prompt:

[Standard prompt preamble]
...
Examples:
<text>{{Dynamically inserted similar source text 1}}</text>
<translated_text>{{Corresponding high-quality translation 1}}</translated_text>
<text>{{Dynamically inserted similar source text 2}}</text>
<translated_text>{{Corresponding high-quality translation 2}}</translated_text>
...
[Rest of the standard prompt]

This dynamic approach allowed the model to continuously improve and adapt, using the growing database of high-quality translations to inform future tasks.
The following diagram illustrates the process workflow.

dynamic prompting with RAG

How to ground translations with a vector store

The process includes the following steps:

  1. Convert the new text to be translated into a vector using the same embeddings model.
  2. Compare text and embeddings against a database of high-quality existing translations.
  3. Combine similar translations with an existing prompt template of generic translation examples for target language.
  4. Send the new augmented prompt with initial text to be translated to Amazon Bedrock.
  5. Store the output of the translation in an existing database or to be saved for human-in-the-loop evaluation.

The results: A 95% cost reduction and beyond

The impact of implementing these advanced techniques on Amazon Bedrock with Anthropic’s Claude 3 Haiku and the engineering effort with AWS account teams was nothing short of innovative for 123RF. By working with AWS, 123RF was able to achieve a staggering 95% reduction in translation costs. But the benefits extended far beyond cost savings:

  • Scalability – The new solution with Anthropic’s Claude 3 Haiku allowed 123RF to rapidly expand their multilingual offerings. They quickly rolled out translations for 9 languages, with plans to cover all 15 target languages in the near future.
  • Quality improvement – Despite the massive cost reduction, the quality of translations saw a marked improvement. The context-aware nature of the LLM, combined with careful prompt engineering, resulted in more natural and accurate translations.
  • Handling of edge cases – The system showed remarkable prowess in handling complex cases such as idiomatic expressions and technical jargon, which had been pain points with previous solutions.
  • Faster time-to-market – The efficiency of the new system significantly reduced the time required to make new content available in multiple languages, giving 123RF a competitive edge in rapidly updating their global offerings.
  • Resource reallocation – The cost savings allowed 123RF to reallocate resources to other critical areas of their business, fostering innovation and growth.

Looking ahead: Continuous improvement and expansion

The success of this project has opened new horizons for 123RF and set the stage for further advancements:

  • Expanding language coverage – With the cost barrier significantly lowered, 123RF is now planning to expand their language offerings beyond the initial 15 target languages, potentially tapping into new markets and user bases.
  • Anthropic’s Claude 3.5 Haiku – The recent release of Anthropic’s Claude 3.5 Haiku has sparked excitement at 123RF. This upcoming model promises even greater intelligence and efficiency, potentially allowing for further refinements in translation quality and cost-effectiveness.
  • Broader AI integration – Encouraged by the success in translation, 123RF is exploring additional use cases for generative AI within their operations. Potential areas include the following:
    • Enhanced image tagging and categorization.
    • Content moderation of user-generated images.
    • Personalized content recommendations for users.
  • Continuous learning loop – The team is working on implementing a feedback mechanism where successful translations are automatically added to the vector database, creating a virtuous cycle of continuous improvement.
  • Cross-lingual search enhancement – Using the improved translations, 123RF is developing more sophisticated cross-lingual search capabilities, allowing users to find relevant content regardless of the language they search in.
  • Prompt catalog – They can explore the newly launched Amazon Bedrock Prompt Management as a way to manage prompt templates and iterate on them effectively.

Conclusion

123RF’s success story with Amazon Bedrock and Anthropic’s Claude is more than just a tale of cost reduction—it’s a blueprint for how businesses can use cutting-edge AI to break down language barriers and truly globalize their digital content. This case study demonstrates the transformative power of innovative thinking, advanced prompt engineering, and the right technological partnership.

123RF’s journey offers the following key takeaways:

  • The power of prompt engineering in extracting optimal performance from LLMs
  • The importance of context and domain-specific knowledge in AI translations
  • The potential of dynamic, adaptive AI solutions in solving complex business challenges
  • The critical role of choosing the right technology partner and platform

As we look to the future, it’s clear that the combination of cloud computing, generative AI, and innovative prompt engineering will continue to reshape the landscape of multilingual content management. The barriers of language are crumbling, opening up new possibilities for global communication and content discovery.

For businesses facing similar challenges in global content discovery, 123RF’s journey offers valuable insights and a roadmap to success. It demonstrates that with the right technology partner and a willingness to innovate, even the most daunting language challenges can be transformed into opportunities for growth and global expansion. If you have a similar use case and want help implementing this technique, reach out to your AWS account teams, or sharpen your prompt engineering skills through our prompt engineering workshop available on GitHub.


About the Author

Fahim Surani pictureFahim Surani is a Solutions Architect at Amazon Web Services who helps customers innovate in the cloud. With a focus in Machine Learning and Generative AI, he works with global digital native companies and financial services to architect scalable, secure, and cost-effective products and services on AWS. Prior to joining AWS, he was an architect, an AI engineer, a mobile games developer, and a software engineer. In his free time he likes to run and read science fiction.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, AWS’ flagship generative AI offering for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.

Read More

Connect SharePoint Online to Amazon Q Business using OAuth 2.0 ROPC flow authentication

Connect SharePoint Online to Amazon Q Business using OAuth 2.0 ROPC flow authentication

Enterprises face significant challenges accessing and utilizing the vast amounts of information scattered across organization’s various systems. What if you could simply ask a question and get instant, accurate answers from your company’s entire knowledge base, while accounting for an individual user’s data access levels?

Amazon Q Business is a game changing AI assistant that’s revolutionizing how enterprises interact with their data. With Amazon Q Business, you can access relevant information through natural language conversations, drawing insights from diverse data sources within your organization, adhering to the permissions granted to your user account.

At its core, Amazon Q Business works by first indexing the content from a variety of data sources using built-in data source connectors. These connectors function as an integration layer, unifying content from diverse systems such as Salesforce, Microsoft Exchange, and SharePoint into a centralized index. This consolidated index powers the natural language processing and response generation capabilities of Amazon Q. When a user asks a question using the built-in web experience, Amazon Q Business retrieves relevant content from the index, taking into account user profiles and permissions. It then uses large language models (LLMs) to provide accurate, personalized, and well-written responses based on the consolidated data.

For a full list of Amazon Q supported data source connectors, refer to Supported connectors.

This approach is useful when you need Amazon Q Business to crawl through OneNote or when certificate-based authentication is not preferred or your organization has a strict policy that requires regular password rotation. For a complete list of authentication mechanisms, refer to SharePoint (Online) connector overview.

We provide a step-by-step guide for the Azure AD configuration and demonstrate how to set up the Amazon Q connector to establish this secure integration.

Solution overview

SharePoint is a web-based solution developed by Microsoft that enables organizations to collaborate, manage documents, and share information efficiently. It offers a wide range of features, including using document libraries, viewing lists, publishing pages, sharing events and links, and allowing users to make comments, making it a great tool for team collaboration and content management.

After integrating SharePoint Online with Amazon Q Business, you can ask questions using natural language about the content stored in the SharePoint sites. For example, if your organization’s human resources team manages an internal SharePoint site and maintains a list of holidays for geographical regions, you can ask, “What are the company holidays for this year?” Amazon Q Business will then list region-specific holidays based on your location (country).

The following diagram illustrates the solution architecture. In the upcoming sections, we show you how to implement this architecture. After you integrate Amazon Q Business using the SharePoint connector, Amazon Q Business will crawl through the SharePoint content and update the index whenever content changes. Each published event, page, link, file, comment, OneNote, and attachment on the SharePoint site is treated as a document. In addition to the documents, it also crawls through access control lists (ACLs) for each document (user and group information) and stores them in the . This allows end-users to see chat responses generated only from the documents they have access to.

You can configure Azure AD using either of the following methods:

  • Use the Azure AD console GUI – This is a manual process
  • Use the provided PowerShell script – This is an automated process that takes in the inputs and configures the required permissions

We demonstrate both methods in the following sections.

Prerequisites

To follow along, you need the following prerequisites:

  • The user performing these steps should be a global administrator on Azure AD/Entra ID.
  • You need to configure Microsoft Entra ID and AWS IAM Identity Center. For details, see Configure SAML and SCIM with Microsoft Entra ID and IAM Identity Center.
  • You need a Microsoft Windows instance to run PowerShell scripts and commands with PowerShell 7.4.1+. Details of the required PowerShell modules are described in the following steps.
  • The user should have administrator permissions on the Windows instance.
  • The user running these PowerShell commands should have the right M365 license (for example, M365 E3).

Configure Azure AD using the Azure AD console

To configure Azure AD using the GUI, complete the steps in this section.

Register an Azure AD application

Complete the following steps to register an Azure AD application in the Azure AD tenant that is linked to the SharePoint Online/O365 tenant:

  1. Open the Office 365 Admin Center using the account of a user member of the Tenant Global Admins group.
  2. Navigate to Microsoft Azure Portal.
  3. Search for and choose App registrations.

  1. Choose New registration.

  1. Enter a name for your application, select who can use this application, then choose Register.

An application will be created. You will see a page like the following screenshot.

  1. Note the values for Display name, Application (client) ID, and Directory (tenant) ID. These IDs will be different than what is shown in the screenshot.

Now you can configure the newly registered application with Microsoft Graph and SharePoint API permissions.

When configuring permissions, you have two different options:

  • Option 1 – Allow access to a specific set of SharePoint sites by granting the Selected permission
  • Option 2 – Allow access to all SharePoint sites by granting the FullControl.All permission

For option 1, install the MS Graph PowerShell SDK as a prerequisite.

Option 1: Manually allow access to specific SharePoint sites

If you choose option 1, to grant access to specific sites instead of all sites, you need to complete additional prerequisites.

Make sure you have access to another application in Microsoft Entra ID with Sites.FullControl.All application-level permissions, along with its client ID and client secret. This application won’t be used by the Amazon Q Business connector, but it’s needed to grant Sites.Selected permissions only to the application you just registered. If you don’t have access to an application with Sites.FullControl permissions, you can follow the previous steps to register a new application and grant Sites.FullControl as described in option 2. We refer to this application as SitesFullControlApp.

To configure your permissions using option 1, complete the following steps:

  1. In the navigation pane, choose API permissions.
  2. Choose the options menu (three dots) next to the permissions that were granted by default when you registered the application, and remove those permissions.

  1. Choose Add a permission and then choose Microsoft Graph.

  1. Choose Delegated permissions.

  1. Select Sites.Selected and choose Add permissions.

  1. Add the following Microsoft Graph API delegated permissions:
    1. GroupMember.Read.All
    2. User.Read.All

You will see the permissions listed as shown in the following screenshot.

  1. Some of the following permissions require admin consent in a tenant before it can be used. To grant admin consent, choose Grant admin consent for <organization name> and choose Yes to confirm.

After granting admin consent, your permissions should look like the following screenshot.

  1. To grant permissions to a specific SharePoint site, you’ll need to obtain the Site ID for that site.
    1. Visit the URL of the SharePoint site in your web browser. The URL will be in the format https://yourcompany.sharepoint.com/sites/{SiteName}.
    2. Log in to the site using valid
    3. Edit the URL in your browser’s address bar by appending /_api/site/id to the end of {SiteName}. For example, if the original URL was https://yourcompany.sharepoint.com/sites/HumanResources, modify it to https://yourcompany.sharepoint.com/sites/HumanResources/_api/site/id.
    4. Press Enter, and the browser will display the Site ID for that particular SharePoint site collection.

  1. Run the script after gathering the values listed in the following table.
Variable Description
AppName Display name that you captured earlier.
AppID Application (client) ID that you captured earlier.
SitesFullControlAppID Application (client) ID that was granted with Sites.FullControl.All. This is a prerequisite to have access to another application. This application won’t be used by the Amazon Q Business connector, but it’s needed to grant Sites.Selected permissions only to the application you plan to register.
SitesFullControlAppClientSecret Client secret of the SitesFullControlAppID you entered.
SiteID SharePoint Site ID.
TenantId Directory (tenant) ID that you captured earlier.
param(
  [Parameter(Mandatory = $true,
    HelpMessage = "The friendly name of the app registration")]
  [String]
  $AppName,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was registered")]
  [String]
  $AppID,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was granted with Sites.FullControl.All")]
  [String]
  $SitesFullControlAppID,

  [Parameter(Mandatory = $true,
    HelpMessage = "Client Secret of the APP ID that was granted with Sites.FullControl.All")]
  [string]
  $SitesFullControlAppClientSecret,

  [Parameter(Mandatory = $true,
    HelpMessage = "SharePoint Site ID")]
  [String]
  $SiteId,

  [Parameter(Mandatory = $true,
    HelpMessage = "Your Azure Active Directory tenant ID")]
  [String]
  $TenantId,

  [Parameter(Mandatory = $false)]
  [Switch]
  $StayConnected = $false
)

# You will get access token by logging into application that has Sites.Fullcontrol.All permissions and using the token, grant permissions to new application you created with Sites.Selected permission.

$Scope = "https://graph.microsoft.com/.default"
$TokenEndpoint = "https://login.microsoftonline.com/$TenantId/oauth2/v2.0/token"

# Body of the request

$body = @{
  grant_type    = "client_credentials"
  client_id     = $SitesFullControlAppID
  client_secret = $SitesFullControlAppClientSecret
  scope         = $Scope
}

# Get access token
$response = Invoke-RestMethod -Uri $TokenEndpoint -Method POST -Body $body 

# URL to grant permission to site
$url = "https://graph.microsoft.com/v1.0/sites/$SiteId/permissions"

# Define the body content as JSON string

$Body = @"
{
  "roles": ["fullcontrol"],
  "grantedToIdentities": [{
    "application": {
      "id": "$AppID",
      "displayName": "$AppName"
    }
  }]
}
"@

# Headers
$headers = @{
  "Authorization" = "Bearer $($response.access_token)"
  "Content-Type"  = "application/json"
}
$response = Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $Body

$response

The output from the PowerShell script will look like the following screenshot.

This completes the steps to configure permissions for a specific set of SharePoint site collections.

Option 2: Manually allow access to all SharePoint sites

Complete the following steps to allow full control permissions to all the SharePoint site collections:

  1. In the navigation pane, choose API permissions.
  2. Remove the permissions that were granted by default when you registered the application.
  3. Choose Add a permission and then choose Microsoft Graph.

  1. Choose Delegated permissions.

  1. Select FullControl.All and choose Add permissions.

Next, you configure Microsoft Graph application permissions.

  1. In the navigation pane, choose API permissions.
  2. Choose Add a permission and then choose Microsoft Graph.
  3. Choose Application permissions.

  1. Add the following application permissions:
    •  GroupMember.Read.All
    •  User.Read.All
    •  Notes.Read.All
    •  Sites.Read.All

Next, you configure SharePoint delegated permissions.

  1. In the navigation pane, choose API permissions.
  2. Choose Add a permission and then choose SharePoint.

  1. Choose Delegated permissions.
  2. Expand AllSites, select AllSites.Read, and choose Add permission.

You will find the permissions listed as shown in the following screenshot.

  1. Some of the following permissions require admin consent in a tenant before it can be used. To grant admin consent, choose Grant admin consent for <organization name> and choose Yes to confirm.

After granting admin consent, your permissions will look like the following screenshot.

This completes the steps to configure permissions to allow full control permissions to all the SharePoint site collections.

Create a client secret

Complete the following steps to create a client secret:

  1. In the navigation pane, choose Certificates & secrets.
  2. On the Client secrets tab, choose New client secret.
  3. Enter a value for Description and choose an expiration length Expires.
  4. Choose Add.

  1. Save the client secret value.

This value is needed while configuring Amazon Q. Client secret values can’t be viewed except for immediately after creation. Be sure to save the secret.

Deactivate multi-factor authentication

To deactivate multi-factor authentication (MFA), sign in to the Microsoft Entra Admin Center as security or global administrator and disable the security defaults.

  1. In the navigation pane, choose Identity and Overview.
  2. On the Properties tab, choose Manage security defaults.
  3. Choose Disabled.
  4. For Reason for disabling, select Other and enter your reason.
  5. Choose Save.

Configure Azure AD using the provided PowerShell script

  • Option 1: Grant access to specific SharePoint sites. This approach involves granting the Selected permission, which allows access to a specific set of SharePoint sites.
  • Option 2: Grant access to all SharePoint sites. This approach involves granting the FullControl.All permission, which allows access to all SharePoint sites in your organization.

When configuring permissions, consider your organization’s SharePoint access requirements. Many SharePoint admins prefer to grant Amazon Q Business access only to specific sites that need to be crawled, in which case Option 1 with the Sites.Selected permission would be suitable.

For either option, the user running the PowerShell script should be an Azure AD tenant admin or have tenant admin permissions. Additionally,  .

Run one of the provided PowerShell scripts, then follow the additional instructions. The scripts will perform the following tasks:

  • Register a new application in Azure AD/Entra ID
  • Configure the required permissions
  • Provide admin consent for the configured permissions

Option 1: Use a script to allow access to specific SharePoint sites

There is one additional prerequisite for option 1 (granting Sites.Selected permission): you need access to another application in Microsoft Entra ID that has the Sites.FullControl.All application-level permission. These are required to grant the Sites.Selected permission to the new application you will register. If you don’t have access to an application with the Sites.FullControl.All permission, you can follow the to register a new application and grant it the Sites.FullControl.All permission. This application will be referred to as SitesFullControlApp.

 Use the following script to grant permissions to a specific SharePoint site. You need the following information before running the script.

Variable Description
AppName Name of the application that you plan to register.
SitesFullControlAppID Application (client) ID that was granted with Sites.FullControl.All. This is a prerequisite to have access to another application. This application won’t be used by the Amazon Q Business connector, but it’s needed to grant Sites.Selected permissions only to the application you plan to register.
SitesFullControlAppClientSecret Client secret of the app ID you entered.
SiteID SharePoint Site ID.
TenantId Your Azure Active Directory tenant ID.
param(
  [Parameter(Mandatory = $true,
    HelpMessage = "The friendly name of the app registration")]
  [String]
  $AppName,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was registered")]
  [String]
  $AppID,
  [Parameter(Mandatory = $true,
    HelpMessage = "Application (client) ID that was granted with Sites.FullControl.All")]
  [String]
  $SitesFullControlAppID,

  [Parameter(Mandatory = $true,
    HelpMessage = "Client Secret of the APP ID that was granted with Sites.FullControl.All")]
  [string]
  $SitesFullControlAppClientSecret,

  [Parameter(Mandatory = $true,
    HelpMessage = "SharePoint Site ID")]
  [String]
  $SiteId,

  [Parameter(Mandatory = $true,
    HelpMessage = "Your Azure Active Directory tenant ID")]
  [String]
  $TenantId,

  [Parameter(Mandatory = $false)]
  [Switch]
  $StayConnected = $false
)

# You will get access token by logging into application that has Sites.Fullcontrol.All permissions and using the token, grant permissions to new application you created with Sites.Selected permission.

$Scope = "https://graph.microsoft.com/.default"
$TokenEndpoint = "https://login.microsoftonline.com/$TenantId/oauth2/v2.0/token"

# Body of the request

$body = @{
  grant_type    = "client_credentials"
  client_id     = $SitesFullControlAppID
  client_secret = $SitesFullControlAppClientSecret
  scope         = $Scope
}

# Get access token
$response = Invoke-RestMethod -Uri $TokenEndpoint -Method POST -Body $body 

# URL to grant permission to site
$url = "https://graph.microsoft.com/v1.0/sites/$SiteId/permissions"

# Define the body content as JSON string

$Body = @"
{
  "roles": ["fullcontrol"],
  "grantedToIdentities": [{
    "application": {
      "id": "$AppID",
      "displayName": "$AppName"
    }
  }]
}
"@

# Headers
$headers = @{
  "Authorization" = "Bearer $($response.access_token)"
  "Content-Type"  = "application/json"
}
$response = Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $Body

$response

The output from the PowerShell script will look like the following screenshot.

Note down the secret value shown in the output and then close the window for security. You will not able to retrieve this value again.

Option 2: Use a script to manually grant access to all SharePoint sites

The following script grants full control permissions to all the SharePoint site collections. You need the following information before running the script.

Variable Description
AppName The name of the application that you plan to register.
param(
  [Parameter(Mandatory=$true,
  HelpMessage="The friendly name of the app registration")]
  [String]
  $AppName,
  [Parameter(Mandatory=$false,
  HelpMessage="Your Azure Active Directory tenant ID")]
  [String]
  $TenantId,
  [Parameter(Mandatory=$false)]
  [Switch]
  $StayConnected = $false
)

# Requires an admin
if ($TenantId)
{
  Connect-MgGraph -Scopes "Application.ReadWrite.All User.Read AppRoleAssignment.ReadWrite.All  DelegatedPermissionGrant.ReadWrite.All" -TenantId $TenantId
}
else
{
  Connect-MgGraph -Scopes "Application.ReadWrite.All User.Read AppRoleAssignment.ReadWrite.All  DelegatedPermissionGrant.ReadWrite.All"
}
$SitePermissionAllSitesRead = "4e0d77b0-96ba-4398-af14-3baa780278f4"
$GraphPermissionsGroupMemberReadAll  = "98830695-27a2-44f7-8c18-0c3ebc9698f6"
$GraphPermissionsNotesReadAll = "3aeca27b-ee3a-4c2b-8ded-80376e2134a4"
$GraphPermissionsSitesReadAll = "332a536c-c7ef-4017-ab91-336970924f0d"
$GraphPermissionsUserReadAll = "df021288-bdef-4463-88db-98f22de89214"
$GraphPermissionsSitesFullControlAll = "5a54b8b3-347c-476d-8f8e-42d5c7424d29"

# Sharepoint permissions 
$sharePointResourceId = "00000003-0000-0ff1-ce00-000000000000"
$SitePermission = @(
  @{
  Id= $SitePermissionAllSitesRead  #AllSites.Read (Delegated) – Read items in all site collections
  Type="Scope"
}
)

# Graph permissions 
$graphResourceId =  "00000003-0000-0000-c000-000000000000"
$graphPermissions = @(
    @{
        Id =  $GraphPermissionsGroupMemberReadAll  # GroupMember.Read.All (Application)
        Type = "Role"
    },
    @{
        Id = $GraphPermissionsNotesReadAll # Notes.Read.All (Application)
        Type = "Role"
    },
    @{
        Id = $GraphPermissionsSitesReadAll # Sites.Read.All (Application)
        Type = "Role"
    },
    @{
        Id =  $GraphPermissionsUserReadAll # User.Read.All (Application)
        Type = "Role"
    },
     @{
        Id = $GraphPermissionsSitesFullControlAll # Sites.FullControl.All (Delegated)
        Type = "Scope"
    }
)


$requiredResourceAccess = @()

$graphResourceAccess   = @{
ResourceAppId=$graphResourceId
ResourceAccess= $graphPermissions
}

$spResourceAccess = @{
    ResourceAppId = $sharePointResourceId
    ResourceAccess = $SitePermission
  }

$requiredResourceAccess += $spResourceAccess
$requiredResourceAccess += $graphResourceAccess


# Get context for access to tenant ID
$context = Get-MgContext

# Create app registration
$appRegistration = New-MgApplication -DisplayName $AppName -SignInAudience "AzureADMyOrg" `
 -Web @{ RedirectUris="http://localhost"; } `
 -RequiredResourceAccess   $requiredResourceAccess `
 -AdditionalProperties @{}
Write-Host -ForegroundColor Cyan "App registration created with app ID" $appRegistration.AppId

# Add client secret
#$clientSecret = [System.Net.WebUtility]::UrlEncode(([System.Text.Encoding]::UTF8.GetBytes((New-Guid).ToString() + "abcdefghijklmnopqrstuvwxyz0123456789")))
$clientSecretCredential = Add-MgApplicationPassword -ApplicationId $appRegistration.Id -PasswordCredential @{ displayName  = "Client Secret"; EndDateTime = (Get-Date).AddYears(2) } 
Write-Host -ForegroundColor Cyan "Client secret created "

$secretValue = $clientSecretCredential.SecretText
Write-Host  -ForegroundColor  Red "Secret Text is [$secretValue]"
Write-Host  -ForegroundColor  Red  "Please Clear the screen after noting down the Secret value."
#$clientSecretCredential |  Format-List

# Create corresponding service principal
$servicePrincipal= New-MgServicePrincipal -AppId $appRegistration.AppId -AdditionalProperties @{} | Out-Null
Write-Host -ForegroundColor Cyan "Service principal created"
Write-Host
Write-Host -ForegroundColor Green "Success"
Write-Host

#Admin consent
$scp = Get-MgServicePrincipal -Filter "DisplayName eq '$($AppName)'" 
$app = Get-MgServicePrincipal -Filter "AppId eq '$graphResourceId'" 

New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId  $GraphPermissionsGroupMemberReadAll
New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId $GraphPermissionsNotesReadAll
New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId $GraphPermissionsSitesReadAll
New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId $GraphPermissionsUserReadAll
New-MgOAuth2PermissionGrant -ClientId $scp.id  -consentType "AllPrincipals"  -resourceId $app.Id  -Scope "Sites.FullControl.All"


if ($StayConnected -eq $false)
{
  Disconnect-MgGraph
  Write-Host "Disconnected from Microsoft Graph"
}
else
{
  Write-Host
  Write-Host -ForegroundColor Yellow "The connection to Microsoft Graph is still active. To disconnect, use Disconnect-MgGraph"
}

The output from the PowerShell script will look like the following screenshot.

Note down the secret value shown in the output and then close the window for security. You will not able to retrieve this value again.

Configure Amazon Q

Make sure you have set up Amazon Q Business with Entra ID as your identity provider as mentioned in the prerequisites. Also, make sure the email ID is in lowercase letters while creating the user in Entra ID.

Follow the instructions in Connecting Amazon Q Business to SharePoint (Online) using the console.

For Step 9 (Authentication), we choose Oauth 2.0 and configure it as follows:

  1. For Tenant ID, enter the tenant ID of your SharePoint account.

This is the directory (tenant) ID in your registered Azure application, in the Azure Portal, as shown in the following screenshot (the IDs will be different for your setup).

  1. For the AWS Secrets Manager secret, create a secret on the Secrets Manager console to store your SharePoint authentication credentials:
  2. For Secret name, enter a name for your secret.
  3. For Username, enter user name for your SharePoint account.
  4. For Password, enter password for your SharePoint account.
  5. For Client ID, enter the Azure AD client ID generated when you registered SharePoint in Azure AD. This is the application (client) ID created in the Azure Portal when registering the SharePoint application in Azure, as described earlier.
  6. For Client secret, enter the secret generated earlier.

Frequently asked questions

In this section, we discuss some frequently asked questions.

Amazon Q Business isn’t answering questions

There are a few possible scenarios for this issue. If no users are getting a response from a specific document, verify that you have synced your data source with Amazon Q. Choose View report in the Sync run history section. For more information, see Introducing document-level sync reports: Enhanced data sync visibility in Amazon Q Business.

If a specific user is unable to access, verify that their email address in SharePoint matches with the email address of the corresponding identity in IAM Identity Center and entered in lowercase in IAM Identity Center. For more information, refer to Known limitations for the SharePoint (Online) connector.

For troubleshooting purposes, you can use the

Amazon Q Business isn’t answering any questions that are in the event attachment or comments in the SharePoint site

The connector crawls event attachments only when Events is also chosen as an entity to be crawled. Make sure that you chose the corresponding entities in the sync scope.

Error message that authentication failed

In some cases, you might get the error message,  “Sharepoint Connector Error code: SPE-5001 Error message: Authentication failed:” when trying to sync.

To address this, validate that the user name, password, clientId, clientSecret, and authType values are correct in the secret that you created for this connector. Verify that MFA is deactivated.

Amazon Q Business is showing old data after an update

After the content has been updated on SharePoint, you must re-sync the contents for the updated data to be picked up by Amazon Q. Go to the data sources, select the SharePoint data source, and choose Sync now. After the sync is complete, verify that the updated data is reflected by running queries on Amazon Q.

Unable to sign in as a new user through the web experience URL

If you experience issues when signing in, clear your browser cookies and sign in as a new user.

Error message that the application is not set up correctly

Verify that the user or group has subscriptions to Amazon Q Business. Check the corresponding user group and then choose Manage access and subscriptions and select the corresponding subscription.

Error message when uploading a file

In some cases, users might get the following message when they upload a file through their user experience: “Your Amazon Q Business subscription doesn’t include file uploads. Please contact your administrator for assistance.”

Troubleshooting

For troubleshooting guidance, refer to Troubleshooting your SharePoint (Online) connector.

Clean up

Complete the following steps to clean up your resources:

  1. Open the Office 365 Admin Center using the account of a user member of the Tenant Global Admins group.
  2. Navigate to the Microsoft Azure Portal.
  3. Search for and choose App registrations.
  4. Select the app you created earlier, then choose Delete.
  5. On the Amazon Q Business console, choose Applications in the navigation pane.
  6. Select the application you created, and on the Actions menu, choose Delete.

Additional capabilities of Amazon Q Business

Amazon Q Business offers much more than just a powerful AI assistant. Explore its other capabilities that allow you to customize the user experience, empower your workforce, and increase productivity:

  • Admin controls and guardrails – Customize your application environment to your organizational needs. Amazon Q Business offers application environment guardrails or chat controls that you can configure to control the end-user chat experience. For example, admins can define specific topics that should be blocked or controlled in the application. You can customize how Amazon Q Business responds when these topics are mentioned by end-users.
  • Amazon Q Apps – Empower your teams to build lightweight, purpose-built applications that streamline tasks and workflows without coding experience. For example, you could build an app that drafts personalized sales emails to customers informing them of new product launches or generates social media content for specified social media networks based on your data.
  • Plugins for Amazon Q Business – Seamlessly integrate with supported third-party services that allow you to perform specific tasks like creating an incident ticket in ServiceNow or raising an issue in Jira—all without leaving the Amazon Q interface.

Conclusion

In this post, we explored how to integrate Amazon Q Business with SharePoint Online using the OAuth 2.0 ROPC flow authentication method. We provided both manual and automated approaches using PowerShell scripts for configuring the required Azure AD settings. Additionally, we demonstrated how to enter those details along with your SharePoint authentication credentials into the Amazon Q console to finalize the secure connection.

The ROPC flow offers an alternative to certificate-based authentication for connecting Amazon Q Business to SharePoint Online. This can be useful when you want Amazon Q Business to crawl through OneNote or if you don’t want to deal with certificates or in scenarios that require regular password rotation.

By following this post, enterprises can take advantage of the powerful knowledge mining capabilities of Amazon Q to unlock insights from their SharePoint data repositories and knowledge bases.


About the Author

Ramesh Eega is a Global Accounts Solutions Architect based out of Atlanta, GA. He is passionate about helping customers throughout their cloud journey.

Read More

John Snow Labs Medical LLMs are now available in Amazon SageMaker JumpStart

John Snow Labs Medical LLMs are now available in Amazon SageMaker JumpStart

Today, we are excited to announce that John Snow Labs’ Medical LLM – Small and Medical LLM – Medium large language models (LLMs) are now available on Amazon SageMaker Jumpstart. Medical LLM is optimized for the following medical language understanding tasks:

  • Summarizing clinical encounters – Summarizing discharge notes, progress notes, radiology reports, pathology reports, and various other medical reports
  • Question answering on clinical notes or biomedical research – Answering questions about a clinical encounter’s principal diagnosis, test ordered, or a research abstract’s study design or main outcomes

For medical doctors, this tool provides a rapid understanding of a patient’s medical journey, aiding in timely and informed decision-making from extensive documentation. This summarization capability not only boosts efficiency but also makes sure that no critical details are overlooked, thereby supporting optimal patient care and enhancing healthcare outcomes.

In a blind evaluation performed by the John Snow Labs research team, Medical LLM – Small outperformed GPT-4o in medical text summarization, being preferred by doctors 88% more often for factuality, 92% more for clinical relevance, and 68% more for conciseness. The model also excelled in clinical notes question answering, preferred 46% more for factuality, 50% more for relevance, and 44% more for conciseness. In biomedical research question answering, the model was preferred even more dramatically: 175% for factuality, 300% for relevance, and 356% for conciseness. Notably, despite being smaller than competitive models by more than an order of magnitude, the small model performed comparably in open-ended medical question answering tasks.

Medical LLM in SageMaker JumpStart is available in two sizes: Medical LLM – Small and Medical LLM – Medium. The models are deployable on commodity hardware, while still delivering state-of-the-art accuracy. This is significant for medical professionals who need to process millions to billions of patient notes without straining computing budgets.

Both models support a context window of 32,000 tokens, which is roughly 50 pages of text. You can try out the models with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. In this post, we walk through how to discover and deploy Medical LLM – Small using SageMaker JumpStart.

About John Snow Labs

John Snow Labs, the AI for healthcare company, provides state-of-the-art software, models, and data to help healthcare and life science organizations put AI to good use. John Snow Labs is the developer behind Spark NLP, Healthcare NLP, and Medical LLMs. Its award-winning medical AI software powers the world’s leading pharmaceuticals, academic medical centers, and health technology companies. John Snow Labs’ Medical Language Models is by far the most widely used natural language processing (NLP) library by practitioners in the healthcare space (Gradient Flow, The NLP Industry Survey 2022 and the Generative AI in Healthcare Survey 2024).

John Snow Labs’ state-of-the-art AI models for clinical and biomedical language understanding include:

  • Medical language models, consisting of over 2,400 pre-trained models for analyzing clinical and biomedical text
  • Visual language models, focused on understanding visual documents and forms
  • Peer-reviewed, state-of-the-art accuracy on a variety of common medical language understanding tasks
  • Tested for robustness, fairness, and bias

What is SageMaker JumpStart

With SageMaker JumpStart, you can choose from a broad selection of publicly available foundation models (FMs). ML practitioners can deploy FMs to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment. You can now discover and deploy a Medical LLM – Small model with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and machine learning operations (MLOps) controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your virtual private cloud (VPC) controls, helping provide data security. The Medical LLM – Small model is available today for deployment and inference in SageMaker Studio.

Discover the Medical LLM – Small model in SageMaker JumpStart

You can access the FMs through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

From the SageMaker JumpStart landing page, you can discover various models by browsing through different hubs, which are named after model providers. You can find the Medical LLM – Small model in the John Snow Labs hub (see the following screenshot). If you don’t see the Medical LLM – Small model, update your SageMaker Studio version by shutting down and restarting. For more information, refer to Shut down and Update Studio Classic Apps.

You can also find the Medical LLM – Small model by searching for “John Snow Labs” in the search field.

You can choose the model card to view details about the model such as license, data used to train, and how to use the model. You will also find two options to deploy the model, Deploy and Preview notebooks, which will deploy the model and create an endpoint.

Subscribe to the Medical LLM – Small model in AWS Marketplace

This model requires an AWS Marketplace subscription. When you choose Deploy in SageMaker Studio, you will be prompted to subscribe to the AWS Marketplace listing if you don’t already have it. If you are already subscribed, choose Deploy.

If you don’t have an active AWS Marketplace subscription, choose Subscribe. You will be redirected to the listing on AWS Marketplace. Review the terms and conditions and choose Accept offer.

After you’ve successfully subscribed to the model on AWS Marketplace, you can now deploy the model in SageMaker JumpStart.

Deploy the Medical LLM – Small model in SageMaker JumpStart

When you choose Deploy in SageMaker Studio, deployment will start.

You can monitor the progress of the deployment on the endpoint details page that you’re redirected to.

On the same endpoint details page, on the Test inference tab, you can send a test inference request to a deployed model. This is useful if you want to verify that your endpoint responds to requests as expected. The following prompt asks a question to the Medical LLM – Small model with the context followed by the question and checks the resulting response. Performance metrics, such as execution length time, is also included.

You can also test out the medical text summarization response.

Deploy the model and run inference through a notebook

Alternatively, you can choose Open in JupyterLab to deploy the model through the example notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources. You can configure additional parameters as needed, but SageMaker JumpStart enables you to deploy and run inference out of the box with the included code.

The notebook already has the necessary code to deploy the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To learn more, refer to the API documentation.

After you deploy the model, you can run real-time or batch inference against the deployed endpoint. The notebook includes example code and instructions for both.

Clean up

After you’re done running the notebook, delete all resources that you created in the process.

When deploying the endpoint from the SageMaker Studio console, you can delete it by choosing Delete on the endpoint details page.

If you want to unsubscribe to the model package completely, you need to unsubscribe from the product in AWS Marketplace:

  1. Navigate to the Machine Learning tab on your software subscriptions page.
  2. Locate the listing that you want to cancel the subscription for, then choose Cancel Subscription to cancel the subscription.

Complete these cleanup steps to avoid continued billing for the model.

Conclusion

In this post, we showed you how to get started with the first healthcare-specific model available now in SageMaker JumpStart. Check out SageMaker JumpStart in SageMaker Studio now to get started. To learn more, refer to the following resources:


About the Authors

Art Tuazon is a Solutions Architect on the CSC team at AWS. She supports both AWS Partners and customers on technical best practices. In her free time, she enjoys running and cooking.

Beau Tse is a Partner Solutions Architect at AWS. He focuses on supporting AWS Partners through their partner journey and is passionate about enabling them on AWS. In his free time, he enjoys traveling and dancing.

David headshotDavid Talby is the Chief Technology Officer at John Snow Labs, helping companies apply artificial intelligence to solve real-world problems in healthcare and life science. He was named USA CTO of the Year by the Global 100 Awards and Game Changers Awards in 2022.

Read More

Now Hear This: World’s Most Flexible Sound Machine Debuts

Now Hear This: World’s Most Flexible Sound Machine Debuts

A team of generative AI researchers created a Swiss Army knife for sound, one that allows users to control the audio output simply using text.

While some AI models can compose a song or modify a voice, none have the dexterity of the new offering.

Called Fugatto (short for Foundational Generative Audio Transformer Opus 1), it generates or transforms any mix of music, voices and sounds described with prompts using any combination of text and audio files.

For example, it can create a music snippet based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice — even let people produce sounds never heard before.

“This thing is wild,” said Ido Zmishlany, a multi-platinum producer and songwriter — and cofounder of One Take Audio, a member of the NVIDIA Inception program for cutting-edge startups. “Sound is my inspiration. It’s what moves me to create music. The idea that I can create entirely new sounds on the fly in the studio is incredible.”

A Sound Grasp of Audio

“We wanted to create a model that understands and generates sound like humans do,” said Rafael Valle, a manager of applied audio research at NVIDIA and one of the dozen-plus people behind Fugatto, as well as an orchestral conductor and composer.

Supporting numerous audio generation and transformation tasks, Fugatto is the first foundational generative AI model that showcases emergent properties — capabilities that arise from the interaction of its various trained abilities — and the ability to combine free-form instructions.

“Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale,” Valle said.

A Sample Playlist of Use Cases

For example, music producers could use Fugatto to quickly prototype or edit an idea for a song, trying out different styles, voices and instruments. They could also add effects and enhance the overall audio quality of an existing track.

“The history of music is also a history of technology. The electric guitar gave the world rock and roll. When the sampler showed up, hip-hop was born,” said Zmishlany. “With AI, we’re writing the next chapter of music. We have a new instrument, a new tool for making music — and that’s super exciting.”

An ad agency could apply Fugatto to quickly target an existing campaign for multiple regions or situations, applying different accents and emotions to voiceovers.

Language learning tools could be personalized to use any voice a speaker chooses. Imagine an online course spoken in the voice of any family member or friend.

Video game developers could use the model to modify prerecorded assets in their title to fit the changing action as users play the game. Or, they could create new assets on the fly from text instructions and optional audio inputs.

Making a Joyful Noise

“One of the model’s capabilities we’re especially proud of is what we call the avocado chair,” said Valle, referring to a novel visual created by a generative AI model for imaging.

For instance, Fugatto can make a trumpet bark or a saxophone meow. Whatever users can describe, the model can create.

With fine-tuning and small amounts of singing data, researchers found it could handle tasks it was not pretrained on, like generating a high-quality singing voice from a text prompt.

Users Get Artistic Controls

Several capabilities add to Fugatto’s novelty.

During inference, the model uses a technique called ComposableART to combine instructions that were only seen separately during training. For example, a combination of prompts could ask for text spoken with a sad feeling in a French accent.

The model’s ability to interpolate between instructions gives users fine-grained control over text instructions, in this case the heaviness of the accent or the degree of sorrow.

“I wanted to let users combine attributes in a subjective or artistic way, selecting how much emphasis they put on each one,” said Rohan Badlani, an AI researcher who designed these aspects of the model.

“In my tests, the results were often surprising and made me feel a little bit like an artist, even though I’m a computer scientist,” said Badlani, who holds a master’s degree in computer science with a focus on AI from Stanford.

The model also generates sounds that change over time, a feature he calls temporal interpolation. It can, for instance, create the sounds of a rainstorm moving through an area with crescendos of thunder that slowly fade into the distance. It also gives users fine-grained control over how the soundscape evolves.

Plus, unlike most models, which can only recreate the training data they’ve been exposed to, Fugatto allows users to create soundscapes it’s never seen before, such as a thunderstorm easing into a dawn with the sound of birds singing.

A Look Under the Hood

Fugatto is a foundational generative transformer model that builds on the team’s prior work in areas such as speech modeling, audio vocoding and audio understanding.

The full version uses 2.5 billion parameters and was trained on a bank of NVIDIA DGX systems packing 32 NVIDIA H100 Tensor Core GPUs.

Fugatto was made by a diverse group of people from around the world, including India, Brazil, China, Jordan and South Korea. Their collaboration made Fugatto’s multi-accent and multilingual capabilities stronger.

One of the hardest parts of the effort was generating a blended dataset that contains millions of audio samples used for training. The team employed a multifaceted strategy to generate data and instructions that considerably expanded the range of tasks the model could perform, while achieving more accurate performance and enabling new tasks without requiring additional data.

They also scrutinized existing datasets to reveal new relationships among the data. The overall work spanned more than a year.

Valle remembers two moments when the team knew it was on to something. “The first time it generated music from a prompt, it blew our minds,” he said.

Later, the team demoed Fugatto responding to a prompt to create electronic music with dogs barking in time to the beat.

“When the group broke up with laughter, it really warmed my heart.”

Hear what Fugatto can do:

Read More

Supercharging Training using float8 and FSDP2

Supercharging Training using float8 and FSDP2

IBM: Tuan Hoang Trong, Alexei Karve, Yan Koyfman, Linsong Chu, Divya Kumari, Shweta Salaria, Robert Walkup, Praneet Adusumilli, Nirmit Desai, Raghu Ganti, Seetharami Seelam
Meta: Less Wright, Wei Feng, Vasiliy Kuznetsov, Driss Guesseous

In this blog, we will demonstrate how we achieve up to 50% throughput speedup while achieving loss and evaluation benchmark parity in training over FSDP1 bf16 training. We achieve this speedup by leveraging FSDP2, DTensor, and torch.compile with torchao’s float8 via linear layer updates (compute), and float8 all_gathers for weight communication. We showcase these improvements across a spectrum of Meta LLaMa model architecture sizes, ranging from small 1.8B model size all the way to 405B model size, making training faster than ever.

We demonstrate these improvements using the Meta Llama3 architecture, and then perform model quality studies at two scales: 100B tokens at 8B model size, and 50B tokens at 70B model size, which provide an exact comparison of float8 and bf16 training loss curves. We demonstrate that the loss curves result in identical loss convergence across these model training runs compared to the bf16 counterpart. Further, we train a 3B model to 1T tokens using the FineWeb-edu dataset and run standard evaluation benchmarks to ensure that the model quality is intact and comparable to a bf16 run.

At IBM Research, we plan to adopt these capabilities for our data ablations to improve the number of experiments we can perform in a given GPU budget. Longer term, we will follow up with a larger scale model run to demonstrate the end-to-end feasibility of float8 training.

What is Float8?

The float8 format for training models was introduced by NVIDIA, ARM, and Intel in a 2022 paper which demonstrated the feasibility of training using lower precision float8, without sacrificing model quality. With the introduction of newer GPUs like the NVIDIA Hopper series, FP8 training became feasible with the potential of more than 2x improvement in training throughput due to native float8 tensor core support. There are a few challenges to realize this promise:
(i) Enable the core model operations like matmul and attention in float8,
(ii) Enable float8 training in a distributed framework, and
(iii) Enable weight communication between GPUs in float8.
While the float8 matmul was enabled by NVIDIA libraries, the latter two were provided in recent updates to FSDP2 and torchao.

In this blog, we are using torchtitan as the entry point for training, IBM’s deterministic data loader, the float8 linear layer implementation from torchao, and the float8 all gather from the latest PyTorch nightlies in conjunction with FSDP2. For this training, we are using the float8 per tensor (tensorwise) scaling granularity rather than rowwise. We leverage torch.compile to ensure that we get maximum performance gains. We are computing attention in bf16 using SDPA and are currently working on moving this to float8 as well.

Experiments

We perform various experiments to demonstrate the benefits of float8 training. The first is to ensure that model quality is not sacrificed. To verify this, we train an 8B model and 70B model for a few thousand steps and compare the loss curves between both the float8 and bf16 training run. Our experiments are performed on three different H100 clusters with 128, 256, and 512 H100 GPU configurations in very different environments to demonstrate reproducibility. The first cluster is customized on Grand Teton in Meta with 400Gbps custom interconnect, the second is an IBM research cluster with 3.2Tbps Infiniband interconnect, and the third is an IBM Cloud cluster with 3.2Tbps RoCE interconnect for GPU-to-GPU communication.

First, we plot the loss curve comparisons for both these models in the below figures to demonstrate loss parity for a few thousand steps.

Figure 1: (a) 8B model loss parity for 2k steps, (b) 70B loss parity for 1k steps

Figure 1: (a) 8B model loss parity for 2k steps, (b) 70B loss parity for 1k steps

Figure 1: (a) 8B model loss parity for 2k steps, (b) 70B loss parity for 1k steps

We observe that across these different models and in different environments, we obtain loss parity for the small scale of tokens. Next, we characterize the throughput gains for four different model sizes ranging from 1.8B to 405B. We explored the best batch size and activation checkpointing schemes for both the float8 and bf16 training runs to determine the tokens/sec/GPU (wps) metric and report the performance gain. For the 405B model, we leveraged DTensor for tensor parallel training with FSDP2. We use a sequence length of 8K for all our measurements.

Model size wps (bf16) wps (float8) Percent gain
1.8B 29K 35K 18%
8B 8K 10K 28%
70B 956 1430 50%
405B (TP4) 149 227 52%

Table 1: Performance gains over bf16 (both bf16 and float8 use torch.compile)

We observe from Table 1 that the gains for larger models (70B and 405B) reach up to 50%, the smaller models see gains between roughly 20 and 30%. In further experiments, we observed that the addition of float8 all_gather enables a boost of ~5% beyond the compute itself in float8, which is inline with the observations in this blog.

Second, to demonstrate the effectiveness of an FP8 model, we trained a 3B model following the Llama3 architecture for 1T tokens using the FineWeb-edu dataset from Hugging Face. We performed evaluations using the lm-eval-harness framework and present a small portion of these results in the below table. We observe that the bf16 performance is marginally better than the float8 scores (about one percent). While some scores are significantly better with bf16 (e.g., MMLU is 3 pts higher), we expect these gaps to vanish when the right hyper parameters are chosen and across larger scale training runs (e.g., the bf16 run had half the batch size and it is well known that smaller batch size runs can improve evaluation scores).

Benchmark Score (float8) Score (bf16)
MMLU (5-shot) 0.26 0.29
ARC-e 0.73 0.73
ARC-c 0.43 0.46
Hellaswag 0.65 0.67
sciq 0.89 0.88
OpenBook QA 0.43 0.43
PIQA 0.76 0.76
Winogrande 0.60 0.65
Average 0.59 0.60

Table 2: Benchmark scores for float8 trained model running in FP16 for eval (at 1T tokens of FineWeb pre-training).

Finally, we scale our experiments to 512 H100 GPUs on the IBM Cloud cluster. We were able to recreate the results and speedups that we observed even at 512 GPU scale. We summarize these results only for the large models in the below table (70B and 405B).

Model size wps (bf16) wps (float8) Percent gain
70B 960 1448 51%
405B (TP4) 152 217 43%

Table 3: Performance gains over bf16 (both bf16 and float8 use torch.compile) for 512 GPU scale

Future work

We are also working on evaluating other forms of parallelism such as Context Parallelism. We plan to evaluate all of these features to demonstrate the composability and ability to make choices for training large scale models.

Acknowledgements

We thank Davis Wertheimer from IBM Research for enabling the data loader for torchtitan runs enabling us to replay data in the same order across multiple runs. We also thank IBM Cloud for enabling us with early test access to the H100 cluster.

Read More

Accelerating Mixtral MoE fine-tuning on Amazon SageMaker with QLoRA

Accelerating Mixtral MoE fine-tuning on Amazon SageMaker with QLoRA

Companies across various scales and industries are using large language models (LLMs) to develop generative AI applications that provide innovative experiences for customers and employees. However, building or fine-tuning these pre-trained LLMs on extensive datasets demands substantial computational resources and engineering effort. With the increase in sizes of these pre-trained LLMs, the model customization process becomes complex, time-consuming, and often prohibitively expensive for most organizations that lack the necessary infrastructure and skilled talent.

In this post, we demonstrate how you can address these challenges by using fully managed environment with Amazon SageMaker Training jobs to fine-tune the Mixtral 8x7B model using PyTorch Fully Sharded Data Parallel (FSDP) and Quantized Low Rank Adaptation (QLoRA).

We guide you through a step-by-step implementation of model fine-tuning on a GEM/viggo dataset, employing the QLoRA fine-tuning strategy on a single p4d.24xlarge worker node (providing 8 Nvidia A100 40GB GPUs).

Business challenge

Today’s businesses are looking to adopt a variety of LLMs to enhance business applications. Primarily, they’re looking for foundation models (FMs) that are open source (that is, model weights that work without modification from the start) and can offer computational efficiency and versatility. Mistral’s Mixtral 8x7B model, released with open weights under the Apache 2.0 license, is one of the models that has gained popularity with large enterprises due to the high performance that it offers across various tasks. Mixtral employs a sparse mixture of experts (SMoE) architecture, selectively activating only a subset of its parameters for each input during model training. This architecture allows these models to use only 13B (about 18.5%) of its 46.7B total parameters during inference, making it high performing and efficient.

These FMs work well for many use cases but lack domain-specific information that limits their performance at certain tasks. This requires businesses to use fine-tuning strategies to adapt these large FMs to specific domains, thus improving performance on targeted applications. Due to the growing number of model parameters and the increasing context lengths of these modern LLMs, this process is memory intensive and requires advanced AI expertise to align and optimize them effectively. The cost of provisioning and managing the infrastructure increases the overall cost of ownership of the end-to-end solution.

In the upcoming section, we discuss how you can cost-effectively build such a solution with advanced memory optimization techniques using Amazon SageMaker.

Solution overview

To address the memory challenges of fine-tuning LLMs such as Mixtral, we will adopt the QLoRA method. As shown in the following diagram, QLoRA freezes the original model’s weights and adds low-rank trainable parameters to the transformer layers. QLoRA further uses quantization to represent the actual model’s weights in a compact, optimized format such as 4-bit NormalFloat (NF4), effectively compressing the model and reducing its memory footprint. This enables training and fine-tuning these LLMs even on systems with limited memory while maintaining performance comparable to half-precision fine-tuning. QLoRA’s support for double quantization and paged optimizers reduces the memory footprint further by quantizing the quantization constants and effectively handling any sudden memory demands.

During the forward pass computation of this architecture, the 4-bit weights get dequantized to bfloat16 (BF16) precision. On the other hand, the LoRA adapters continue to operate on BF16 precision data. Both (original weights and adapter output vectors) are then added together element-wise to produce the final result, denoted as h.

During the backward pass of the model, the gradients are computed with respect to only the LoRA parameters, not the original base model weights. Although the dequantized original weights are used in calculations, the original 4-bit quantized weights of the base model remain unchanged.

To adopt the following architecture, we will use the Hugging Face Parameter-Efficent Fine-tuning (PEFT) library, which integrates directly with bitsandbytes. This way, the QLoRA technique to fine-tune can be adopted with just a few lines of code.

QLoRA operates on a large FM. In the figure below, X denotes the input tokens of the training data, W is the existing model weights (quantized), and Wa, Wb are the segments of the adapters added by QLoRA. The original model’s weights (W) are frozen, and QLoRA adds adapters (Wa, Wb), which are low-rank trainable parameters, onto the existing transformer layer.

QLoRA explanation showing adapters added onto the existing transformer layer

Figure 1: This figure shows how QLoRA operates. The original model’s weights (W) are frozen, and QLoRA adds in adapters (Wa, Wb) onto the existing transformer layer.

Although QLoRA helps optimize memory during fine-tuning, we will use Amazon SageMaker Training to spin up a resilient training cluster, manage orchestration, and monitor the cluster for failures. By offloading the management and maintenance of the training cluster to SageMaker, we reduce both training time and our total cost of ownership (TCO). Using this approach, you can focus on developing and refining the model while using the fully managed training infrastructure provided by SageMaker Training.

Implementation details

We spin up the cluster by calling the SageMaker control plane through APIs or the AWS Command Line Interface (AWS CLI) or using the SageMaker AWS SDK. In response, SageMaker spins up training jobs with the requested number and type of compute instances. In our example, we use one ml.p4d.24xlarge compute instance.

To take complete advantage of this multi-GPU cluster, we use the recent support of QLoRA and PyTorch FSDP. Although QLoRA reduces computational requirements and memory footprint, FSDP, a data/model parallelism technique, will help shard the model across all eight GPUs (one ml.p4d.24xlarge), enabling training the model even more efficiently. Hugging Face PEFT is where the integration happens, and you can read more about it in the PEFT documentation.

QLoRA adapters are added to the linear layers in the model. The layers (for example, transformer layers, gate networks, and feed-forward networks) put together will form the entire model, as shown in the following diagram, which will be considered to be sharded by FSDP across our cluster (shown as small shards in blue).

The following architecture diagram shows how you can use SageMaker Training to have the SageMaker Control Plane spin up a resilient training job cluster. SageMaker downloads the training image from Amazon Elastic Container Registry (Amazon ECR) and will use Amazon Simple Storage Service (Amazon S3) as an input training data source and to store training artifacts.

Architecture Diagram

Figure 3: Architecture Diagram showing how you can utilize SageMaker Training Jobs to spin up a resilient training cluster. Amazon ECR contains the training image, and Amazon S3 contains the training artifacts.

To put this solution into practice, execute the following use case.

Prerequisites

To perform the solution, you need to have the following prerequisites in place:

  1. Create a Hugging Face User Access Token and get access to the gated repo mistralai/Mixtral-8x7B-v0.1 on Hugging Face.
  2. (Optional) Create a Weights & Biases API key to access the Weights & Biases dashboard for logging and monitoring. This is recommended if you’d like to visualize model training specific metrics.
  3. Request a service quota at Service Quotas for 1x ml.p4d.24xlarge on Amazon SageMaker. To request a service quota increase, on the AWS Service Quotas console, navigate to AWS services, Amazon SageMaker, and choose ml.p4d.24xlarge for training job usage.
  4. Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess and AmazonEC2FullAccess to give required access to SageMaker to run the examples.

This role is for demonstration purposes only. You need to adjust it to your specific security requirements for production. Adhere to the principle of least privilege while defining IAM policies in production.

  1. (Optional) Create an Amazon SageMaker Studio domain (see Quick setup to Amazon SageMaker) to access Jupyter notebooks with the preceding role. (You can use JupyterLab in your local setup too)
  2. Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets.
$ git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git
$ cd 15_mixtral_finetune_qlora

The 15_mixtral_finetune_qlora directory contains the training scripts that you might need to deploy this sample.

Next, we will run the finetune-mixtral.ipynb notebook to fine-tune the Mixtral 8x7B model using QLoRA on SageMaker. Check out the notebook for more details on each step. In the next section, we walk through the key components of the fine-tuning execution.

Solution walkthrough

To perform the solution, follow the steps in the next sections.

Step 1: Set up required libraries

Install the relevant HuggingFace and SageMaker libraries:

!pip install transformers "datasets[s3]==2.18.0" "sagemaker>=2.190.0" "py7zr" "peft==0.12.0" --upgrade –quiet

Step 2: Load dataset

In this example, we use the GEM/viggo dataset from Hugging Face. This is a data-to-text generation dataset in the video game domain. The dataset is clean and organized with about 5,000 data points, and the responses are more conversational than information seeking. This type of dataset is ideal for extracting meaningful information from customer reviews. For example, an ecommerce application such as Amazon.com could use a similarly formatted dataset for fine-tuning a model for natural language processing (NLP) analysis to gauge interest in products sold. The results can be used for recommendation engines. Thus, this dataset is a good candidate for fine-tuning LLMs. To learn more about the viggo dataset, check out this research paper.

Load the dataset and convert it to the required prompt structure. The prompt is constructed with the following elements:

  • Target sentence – Think of this as the final review. In the dataset, this is target.
  • Meaning representation – Think of this as a deconstructed review, broken down by attributes such as inform, request, or give_opinion. In the dataset, this is meaning_representation.

Running the following cell gives us the train_set and test_set (training split and testing split, respectively) with structured prompts. We use the Python map function to structure the dataset splits according to our prompt.

def generate_and_tokenize_prompt(data_point):
    full_prompt = f"""
      Given a target sentence, construct the underlying 
      meaning representation ...
      ['inform', 'request', 'give_opinion', 'confirm', 
      'verify_attribute', 'suggest', 'request_explanation', 
      'recommend', 'request_attribute']

      The attributes must be one of the following:
      ['name', 'exp_release_date', 'release_year', 
      'developer', 'esrb', 'rating', 'genres', 
      'player_perspective', 'has_multiplayer', 'platforms', 
      'available_on_steam', 'has_linux_release', 
      'has_mac_release', 'specifier']

      ### Target sentence:
      {data_point["target"]}

      ### Meaning representation:
      {data_point["meaning_representation"]}
    """
    return {"prompt": full_prompt.strip()}

# Load dataset from the HuggingFace hub
train_set = load_dataset(dataset_name, split="train")
test_set = load_dataset(dataset_name, split="test")

# Add system message to each conversation
columns_to_remove = list(dataset["train"].features)

train_dataset = train_set.map(
  generate_and_tokenize_prompt,
  remove_columns=columns_to_remove,
  batched=False
)

test_dataset = test_set.map(
  generate_and_tokenize_prompt,
  remove_columns=columns_to_remove,
  batched=False
)

Upload the dataset to Amazon S3. This step is crucial because the dataset stored in Amazon S3 will serve as the input data channel for the SageMaker training cluster. SageMaker will efficiently manage the process of distributing this data across the training cluster, allowing each node to access the necessary information for model training.

input_path = f's3://{sess.default_bucket()}/datasets/mixtral'

# Save datasets to s3
train_dataset.to_json(f"{input_path}/train/dataset.json", orient="records")
train_dataset_s3_path = f"{input_path}/train/dataset.json"
test_dataset.to_json(f"{input_path}/test/dataset.json", orient="records")
test_dataset_s3_path = f"{input_path}/test/dataset.json"

We analyze the distribution of prompt tokens to determine the maximum sequence length required for training our model in the upcoming steps.

The following graph shows the prompt tokens plotted. The x-axis is the length of the prompts, and the y-axis is the number of times that length occurs in the training dataset (frequency). We use this to determine the maximum sequence length and pad the rest of the data points accordingly. The maximum number of words in our example is 173.

Input Tokens Distribution

Figure 4: Graph showing the distribution of input token lengths prompted. The x-axis shows the lengths and the y-axis shows the frequency with which those input token lengths occur in the train and test dataset splits.

Step 3: Configure the parameters for SFTTrainer for the fine-tuning task

We use TrlParser to parse hyperparameters in a YAML file that is required to configure SFTTrainer API for fine-tuning the model. This approach offers flexibility because we can also overwrite the arguments specified in the config file by explicitly passing them through the command line interface.

cat > ./args.yaml <<EOF
model_id: "mistralai/Mixtral-8x7B-v0.1" # Hugging Face model id
max_seq_length: 2048 # based in prompt length distribution graph
train_dataset_path: "/opt/ml/input/data/train/" # path to where SageMaker saves train dataset
test_dataset_path: "/opt/ml/input/data/test/" # path to where SageMaker saves test dataset
output_dir: "/opt/ml/model/mixtral/adapter" # path to where SageMaker will upload the model
...

num_train_epochs: 1 # number of training epochs
per_device_train_batch_size: 10 # batch size per device during training
gradient_accumulation_steps: 1 # number of steps before performing a backward/update pass
optim: adamw_torch # use torch adamw optimizer
...

bf16: true # use bfloat16 precision
tf32: true # use tf32 precision
gradient_checkpointing: true # use gradient checkpointing to save memory

# offload FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdp
fsdp: "full_shard auto_wrap" # remove offload if enough GPU memory
fsdp_config:
  backward_prefetch: "backward_pre"
  forward_prefetch: "false"
  use_orig_params: "false"

Step 4: Review the launch script

You are now prepared to fine-tune the model using a combination of PyTorch FSDP and QLoRA. We’ve prepared a script called launch_fsdp_qlora.py that will perform the tasks mentioned in the following steps. The following is a quick review of the key points in this script before launching the training job.

  1. Load the dataset from a JSON file located at the specified path, using the load_dataset function to prepare it for model training.
# Load datasets
train_dataset = load_dataset(
  "json",
  data_files=os.path.join(script_args.train_dataset_path, 
  "dataset.json"),
  split="train",
)
  1. Prepare the tokenizer and the model.

We employ the BitsAndBytes library to configure 4-bit quantization settings for our model, enabling memory-efficient loading and computation.

By setting parameters such as load_in_4bit and bnb_4bit_use_double_quant to True, we enable a dramatic reduction in model size without significant loss in performance. The nf4 quantization type, coupled with bfloat16 compute and storage data types, allows for nuanced control over the quantization process, striking an optimal balance between model compression and accuracy preservation. This configuration enables the deployment of massive models on resource-constrained hardware, making advanced AI more accessible and practical for a wide range of applications.

# Configure model quantization
torch_dtype = torch.bfloat16
quant_storage_dtype = torch.bfloat16

# Configures 4-bit quantization settings for the model
quantization_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_use_double_quant=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch_dtype,
  bnb_4bit_quant_storage=quant_storage_dtype,
)

model_loading_params = {
  "quantization_config": quantization_config,
  "torch_dtype": quant_storage_dtype,
  "use_cache": False if 
  training_args.gradient_checkpointing else True
}

# Loads a pre-trained model from the specified model ID
model = AutoModelForCausalLM.from_pretrained(
  script_args.model_id,
  cache_dir="/opt/ml/sagemaker/warmpoolcache",
  **model_loading_params
)
  1. Initiate the training process using SFTTrainer from the Transformer Reinforcement Learning (TRL) library to fine-tune the model. The SFTTrainer simplifies the process of supervised fine-tuning for LLMs. This approach makes fine-tuning efficient to adapt pre-trained models to specific tasks or domains.

We use the LoraConfig class from the Hugging Face’s PEFT library to configure and add LoRA parameters (also called “adapters”) to the model.

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
  lora_alpha=8,
  lora_dropout=0.05,
  r=16,
  ...
)

################
# Training
################
trainer = SFTTrainer(
  model=model,
  args=training_args,
  train_dataset=train_dataset,
  eval_dataset=test_dataset,
  peft_config=peft_config,
  max_seq_length=script_args.max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  ...
)

trainer.train(resume_from_checkpoint=checkpoint)

Step 5: Fine-tune your model

To fine-tune your model, follow the steps in the next sections.

Launch the training job

You are now ready to launch the training. We use the SageMaker Training estimator, which uses torchrun to initiate distributed training.

The SageMaker estimator simplifies the training process by automating several key tasks in this example:

  1. The SageMaker estimator spins up a training cluster of one ml.p4d.24xlarge instance. SageMaker handles the setup and management of these compute instances, which reduces your TCO.
  2. This estimator also uses one of the pre-built containers managed by SageMaker, PyTorch, which includes an optimized compiled version of the PyTorch framework and its required dependencies and GPU-specific libraries for accelerated computations.
pytorch_estimator = PyTorch(
  entry_point= 'launch_fsdp_qlora.py',
  source_dir="./scripts",
  ...
  framework_version="2.2.0",
  py_version="py310",
  instance_count=1,
  instance_type="ml.p4d.24xlarge",
  sagemaker_session=sess,
  disable_output_compression=True,
  keep_alive_period_in_seconds=1800,
  distribution={"torch_distributed": {"enabled": True}},
  hyperparameters={
    "config": "/opt/ml/input/data/config/args.yaml" #path to 
    TRL config which was uploaded to s3
  }
)

The training process generates trained adapters that will be saved in a default S3 bucket named sagemaker-<region name>-<account_id> for this job.

Monitor your training run

You can monitor training metrics, such as loss, and learning rate for your training run through the Weights & Biases Dashboard. The following figures show the results of the training run, where we track GPU utilization and GPU memory utilization.

The example is optimized to use GPU memory to its maximum capacity. Note that increasing the batch size any further will lead to CUDA Out of Memory errors.

The following graph shows the GPU memory utilization (for all eight GPUs) during the training process. You can also observe the GPU memory utilization for any given point in time.

GPU Memory Utilization

Figure 5: This graph shows the GPU Memory utilization plotted for all 8 GPUs in the training job.

The following graph shows the GPU compute utilization (for all eight GPUs) during the training process. You can also observe the GPU memory utilization for any given point in time.

GPU Compute Utilization

Figure 6: This graph shows the GPU Compute utilization plotted for all 8 GPUs in the training job.

Step 6: Merge the trained adapter with the base model for inference

Merge the training LoRA adapter with the base model. After the merge is complete, run inference to find the results. Specifically, look at how the new fine-tuned and merged model performs compared to the original unmodified Mixtral-8x7b model. The example does the adapter merge and inference both in the same launch script “merge_model_adapter.py.”

Before launching the training job, review the key components of the merge script:

Use the Hugging Face Transformers library. Specifically, use AutoModelForCausalLM to load a PEFT model from a specified HuggingFace model directory (mistralai/Mixtral-8x7B-v0.1). We have configured this library to have a low CPU memory utilization (low_cpu_mem_usage=True) to reduce the CPU to GPU communication overhead, and we’ve also used automatic device mapping (device_map="auto") while offloading the model to a designated folder to manage resource constraints.

# Load a Peft model
base_model = AutoModelForCausalLM.from_pretrained(
  model_id,
  low_cpu_mem_usage=True,
  #torch_dtype=torch.float16,
  device_map="auto",
  offload_folder="/opt/ml/model/"
)

# Load the adapter
peft_model = PeftModel.from_pretrained(
  base_model,
  adapter_dir,
  #torch_dtype=torch.float16,  # Set dtype to float16
  offload_folder="/opt/ml/model/"
)

# Merge the base model with the trained adapter
model = peft_model.merge_and_unload()
print("Merge done")

After the model is merged, send inference requests to generate responses.

def generate_text(model, prompt, max_length=500, num_return_sequences=1):
    ...

    input_ids = tokenizer.encode(prompt_input, 
    return_tensors="pt").to(device)

    # Generate text
    with torch.no_grad():
    output = model.generate(
      input_ids,
      max_length=max_length,
      num_return_sequences=num_return_sequences,
      no_repeat_ngram_size=2,
      top_k=50,
      top_p=0.95,
      temperature=0.7
    )

    # Decode and return the generated text
    generated_texts = [tokenizer.decode(seq, 
    skip_special_tokens=True) for seq in output]

    return generated_texts

print(f"nnn*** Generating Inference on Base Model: {generate_text(base_model,prompt)}nnn")

print(f"***nnn Generating Inference on Trained Model: {generate_text(model,prompt)}nnn")

Step 7: Launch the SageMaker training job to merge the adapter

Run the following script as part of the SageMaker training job.

First, explore the adapters that were saved as part of the training run.

adapter_dir_path=f"{model_artifacts}/mixtral/adapter/"

print(f'nAdapter S3 Dir path:{adapter_dir_path} n')

!aws s3 ls {adapter_dir_path}

# Reference Output
Adapter S3 Dir path:s3://sagemaker-<Region>-<Account-ID>/mixtral-8-7b-finetune-2024-09-08-22-27-42-099/output/model/mixtral/adapter/

PRE checkpoint-64/
PRE runs/
2024-09-08 23:08:07       5101 README.md
2024-09-08 23:07:58        722 adapter_config.json
2024-09-08 23:08:06  969174880 adapter_model.safetensors
2024-09-08 23:08:08        437 special_tokens_map.json
2024-09-08 23:08:04    1795596 tokenizer.json
2024-09-08 23:08:04        997 tokenizer_config.json
2024-09-08 23:08:04       5688 training_args.bin

Create and run the PyTorch estimator to configure the training job.

pytorch_estimator_adapter = PyTorch(
  entry_point= 'merge_model_adapter.py',
  source_dir="./scripts",
  job_name=job_name,
  base_job_name=job_name,
  max_run=5800,
  role=role,
  framework_version="2.2.0",
  py_version="py310",
  instance_count=1,
  instance_type="ml.p4d.24xlarge",
  sagemaker_session=sess,
  disable_output_compression=True,
  keep_alive_period_in_seconds=1800,
  hyperparameters={
    "model_id": "mistralai/Mixtral-8x7B-v0.1",  # Hugging Face model id
    "hf_token": "<hf-token>",
    "dataset_name":dataset_name
  }
)

# starting the train job with our uploaded datasets as input
pytorch_estimator_adapter.fit(data, wait=True)

Here’s the target sentence (key prompt) to generate model inference results:

Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. 
Is your opinion true for all games which don't have multiplayer?

Ground truth inference (data label):

verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation]) 

Original model inference (that is, meaning representation):

inform(name(Little Big Adventure), has_multiplayer(Little Big Adventure))

Fine-tuned model inference result (that is, meaning representation):

verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])

The preceding results compare the inference results of the fine-tuned model against both the ground truth and the inference results of the original unmodified Mixtral 8x7B model. You can observe that the fine-tuned model provides more details and better representation of the meaning than the base model. Run systematic evaluation to quantify the fine-tuned model’s improvements for your production workloads.

Clean up

To clean up your resources to avoid incurring any more charges, follow these steps:

  1. Delete any unused SageMaker Studio resources.
  2. (Optional) Delete the SageMaker Studio domain.
  3. Verify that your training job isn’t running anymore. To do so, on your SageMaker console, choose Training and check Training jobs.
Clean Up

Figure 7: Screenshot showing that there are no training jobs running anymore. This is what your console should look like once you follow the clean-up steps provided

To learn more about cleaning up your provisioned resources, check out Clean up.

Conclusion

In this post, we provided you with a step-by-step guide to fine-tune the Mixtral 8x7B MoE model with QLoRA. We use SageMaker Training Jobs and the Hugging Face PEFT package for QLoRA, with bitsandbytes for quantization together to perform the fine-tuning task. The fine-tuning was conducted using the quantized model loaded on a single compute instance, which eliminates the need of a larger cluster. As observed, the model performance improved with just 50 epochs.

To learn more about Mistral on AWS and to find more examples, check out the mistral-on-aws GitHub repository. To get started, check out the notebook on the mixtral_finetune_qlora GitHub repository. To learn more about generative AI on AWS, check out Generative AI on AWS, Amazon Bedrock, and Amazon SageMaker.


About the Authors

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.

Kanwaljit Khurmi is an AI/ML Principal Solutions Architect at Amazon Web Services. He works with AWS product teams, engineering, and customers to provide guidance and technical assistance for improving the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Nishant Karve is a Sr. Solutions Architect aligned with the healthcare and life sciences (HCLS) domain. He collaborates with large HCLS customers for their generative AI initiatives and guides them from ideation to production.

Read More

Amazon SageMaker Inference now supports G6e instances

Amazon SageMaker Inference now supports G6e instances

As the demand for generative AI continues to grow, developers and enterprises seek more flexible, cost-effective, and powerful accelerators to meet their needs. Today, we are thrilled to announce the availability of G6e instances powered by NVIDIA’s L40S Tensor Core GPUs on Amazon SageMaker. You will have the option to provision nodes with 1, 4, and 8 L40S GPU instances, with each GPU providing 48 GB of high bandwidth memory (HBM). This launch provides organizations with the capability to use a single-node GPU instance—G6e.xlarge—to host powerful open-source foundation models such as Llama 3.2 11 B Vision, Llama 2 13 B, and Qwen 2.5 14B, offering organizations a cost-effective and high-performing option. This makes it a perfect choice for those looking to optimize costs while maintaining high performance for inference workloads.

The key highlights for G6e instances include:

  • Twice the GPU memory compared to G5 and G6 instances, enabling deployment of large language models in FP16 up to:
    • 14B parameter model on a single GPU node (G6e.xlarge)
    • 72B parameter model on a 4 GPU node (G6e.12xlarge)
    • 90B parameter model on an 8 GPU node (G6e.48xlarge)
  • Up to 400 Gbps of networking throughput
  • Up to 384 GB GPU Memory

Use cases

G6e instances are ideal for fine-tuning and deploying open large language models (LLMs). Our benchmarks show that G6e provides higher performance and is more cost-effective compared to G5 instances, making them an ideal fit for use in low-latency, real time use cases such as:

  • Chatbots and conversational AI
  • Text generation and summarization
  • Image generation and vision models

We have also observed that G6e performs well for inference at high concurrency and with longer context lengths. We have provided complete benchmarks in the following section.

Performance

In the following two figures, we see that for long context length of 512 and 1024, G6e.2xlarge provides up to 37% better latency and 60% better throughput compared to G5.2xlarge for a Llama 3.1 8B model.

In the following two figures, we see that G5.2xlarge throws a CUDA out of memory (OOM) when deploying the LLama 3.2 11B Vision model, whereas G6e.2xlarge provides great performance.

In the following two figures, we compare G5.48xlarge (8 GPU node) with the G6e.12xlarge (4 GPU) node, which costs 35% less and is more performant. For higher concurrency, we see that G6e.12xlarge provides 60% lower latency and 2.5 times higher throughput.

In the below figure, we are comparing cost per 1000 tokens when deploying a Llama 3.1 70b which further highlights the cost/performance benefits of using G6e instances compared to G5.

Deployment walkthrough

Prerequisites

To try out this solution using SageMaker, you’ll need the following prerequisites:

Deployment

You can clone the repository and use the notebook provided here.

Clean up

To prevent incurring unnecessary charges, it’s recommended to clean up the deployed resources when you’re done using them. You can remove the deployed model with the following code:

predictor.delete_predictor()

Conclusion

G6e instances on SageMaker unlock the ability to deploy a wide variety of open source models cost-effectively. With superior memory capacity, enhanced performance, and cost-effectiveness, these instances represent a compelling solution for organizations looking to deploy and scale their AI applications. The ability to handle larger models, support longer context lengths, and maintain high throughput makes G6e instances particularly valuable for modern AI applications. Try the code to deploy with G6e.


About the Authors

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging GenAI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Alan TanAlan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Pavan Kumar Madduri is an Associate Solutions Architect at Amazon Web Services. He has a strong interest in designing innovative solutions in Generative AI and is passionate about helping customers harness the power of the cloud. He earned his MS in Information Technology from Arizona State University. Outside of work, he enjoys swimming and watching movies.

Michael Nguyen is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

Read More