Get started with Amazon Titan Text Embeddings V2: A new state-of-the-art embeddings model on Amazon Bedrock

Get started with Amazon Titan Text Embeddings V2: A new state-of-the-art embeddings model on Amazon Bedrock

Embeddings are integral to various natural language processing (NLP) applications, and their quality is crucial for optimal performance. They are commonly used in knowledge bases to represent textual data as dense vectors, enabling efficient similarity search and retrieval. In Retrieval Augmented Generation (RAG), embeddings are used to retrieve relevant passages from a corpus to provide context for language models to generate informed, knowledge-grounded responses. Embeddings also play a key role in personalization and recommendation systems by representing user preferences, item characteristics, and historical interactions as vectors, allowing calculation of similarities for personalized recommendations based on user behavior and item embeddings. As new embedding models are released with incremental quality improvements, organizations must weigh the potential benefits against the associated costs of upgrading, considering factors like computational resources, data reprocessing, integration efforts, and projected performance gains impacting business metrics.

In September of 2023, we announced the launch of Amazon Titan Text Embeddings V1, a multilingual text embeddings model that converts text inputs like single words, phrases, or large documents into high-dimensional numerical vector representations. Since then, many of our customers have used the V1 model, which supported over 25 languages, with an input up to 8,192 tokens and outputs vector of 1,536 dimensions for high accuracy and low latency. The model was made available as a serverless offering via Amazon Bedrock, simplifying embedding generation and integration with downstream applications. We published a follow-up post on January 31, 2024, and provided code examples using AWS SDKs and LangChain, showcasing a Streamlit semantic search app.

Today, we are happy to announce Amazon Titan Text Embeddings V2, our second-generation embeddings model for Amazon Bedrock. The new model is optimized for the most common use cases we see with many of our active customers, including RAG, multi-language, and code embedding use cases. The following table summarizes the key differences compared to V1.

Feature Amazon Titan Text Embeddings V1 Amazon Titan Text Embeddings V2
Output dimension support 1536 256, 512, 1024
Language support 25+ 100+
Unit vector normalization support No Yes
Price per million tokens $0.10 $0.02 per 1 million tokens, or $0.00002 per 1,000 tokens

With these new features, we expect many more customers choosing Amazon Titan Text Embeddings V2 to build common generative artificial intelligence (AI) applications. In this post, we discuss the benefits of the V2 model, how to conduct your own evaluation of the model, and how to migrate to using the new model.

Let’s dig in!

Benefits of Amazon Titan Text Embeddings V2

Amazon Titan Text Embeddings V2 is the second-generation embedding model for Amazon Bedrock, optimized for some of the most common customer use cases we have seen with our customers. Some of the key features include:

  • Optimized for RAG solutions
  • Flexible embedding sizes
  • Improved multilingual support and code

Embeddings have become an integral part of various NLP applications, and their quality is crucial for achieving optimal performance.

The large language model (LLM) landscape is rapidly evolving, with leading providers offering increasingly powerful and versatile embedding models. Although incremental improvements in embedding quality may seem modest at the high level, the actual benefits can be significant for specific use cases. For example, in a recommendation system for a large ecommerce platform, a modest increase in recommendation accuracy could translate into significant additional revenue.

A common way to select an embedding model (or any model) is to look at public benchmarks; an accepted benchmark for measuring embedding quality is the MTEB leaderboard. The Massive Text Embedding Benchmark (MTEB) evaluates text embedding models across a wide range of tasks and datasets. MTEB encompasses 8 different embedding tasks, covering a total of 58 datasets and 112 languages. In this benchmark, 33 different text embedding models were evaluated on the MTEB tasks. A key finding from the benchmark was that no single text embedding method emerged as the clear leader across all tasks and datasets. Each model exhibited strengths and weaknesses depending on the specific embedding task and data characteristics. This highlights the need for continued research into developing more versatile and robust text embedding techniques that can perform well across diverse use cases and language domains.

Although this is a useful benchmark, we caution our enterprise customers with the following considerations:

  • Although the MTEB leaderboard is widely recognized, it provides only a partial assessment by focusing solely on accuracy metrics and overlooking crucial practical factors like inference latency and model capabilities. The leaderboard rankings combine and compare embedding models across different vector dimensions, making direct and fair model comparisons challenging.
  • Additionally, the leaders on this accuracy-centric leaderboard change frequently as new models are continually introduced, providing a shifting and incomplete perspective on practical model performance trade-offs that real-world applications must consider beyond just accuracy numbers.
  • Lastly, costs need to be weighed against the expected benefits and performance improvements in the specific use case. A small gain in accuracy may not justify the significant overhead and opportunity costs of transitioning embeddings models, especially in large-scale, business-critical applications. Enterprises should perform a rigorous cost-benefit analysis to make sure the projected performance uplift from an updated embeddings model provides sufficient return on investment (ROI) to offset the migration costs and operational disruption.

In summary, start with evaluating the benchmark scores, but don’t decide until you have done your own due diligence.

Benchmark results

The Amazon Titan Text Embeddings V2 model has the ability to output embeddings of various size. This implies that if you use a lower size, you’ll reduce your memory footprint, which will translate directly into cost savings. The default size is 1024, compared to V1, which is an 1536 output size, implying a direct cost reduction of approximately 33%, which translates into savings given the cost of a RAG solution has a major component in the form of a vector databases. In our internal testing, we found that using the 256-output token resulted in only about 3.24% accuracy loss while translating to a four times saving due to size reduction. Running our evaluation on MTEB datasets, we found Amazon Titan Text Embeddings V2 to perform competitively with scores like 57.5 on reranking tasks, for example. With the model trained on over 100 languages, it’s no surprise the model achieves scores like 55 on the MIRACL multilingual dataset and has an overall weighted average MTEB score of 60.37. Full MTEB scores are available on the MTEB leaderboard.

However, we strongly encourage you to run your own benchmarks with your own dataset to understand the operational metrics. A sample notebook showing how to run the benchmarks against the MTEB datasets is hosted here. The key steps involved are:

  1. Choose a representative set of data to embed and keywords to search.
  2. Use the Amazon Titan Text Embeddings V2 model to embed your data and keywords, adjusting the chunk size and overlap as needed.
  3. Carry out a similarity search using your preferred vector comparison method (such as Euclidean distance or cosine similarity).

Use Amazon Titan Text Embeddings V2 on Amazon Bedrock

The new Amazon Titan Text Embeddings V2 model is available through the fully managed, serverless experience on Amazon Bedrock. You can use the model through either the Amazon Bedrock REST API or the AWS SDK. The required parameters are the text that you want to generate the embeddings of and the modelID parameter, which represents the name of the Amazon Titan Text Embeddings model. Furthermore, now you can specify the output size of the vector, which is a significant feature of the V2 model.

Throughput has been a key requirement for running large ingestion workloads, and the Amazon Titan Text Embeddings model supports batching via Bedrock Batch to increase the throughput for your workloads. The following code is an example using the AWS SDK for Python (Boto3):

import boto3
import json
 
#Create the connection to Bedrock
 
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2', 
    
)

# Define prompt and model parameters
prompt_data = """Priority should be funding retirement through ROTH/IRA/401K over HSA extra.  You need to fund your HSA for reasonable and expected medical expenses. """
modelId = "amazon.titan-embed-text-v2:0"   
accept = "application/json"
contentType = "application/json"

sample_model_input={
    "inputText": prompt_data,
    "dimensions": 256,
    "normalize": True
}

body = json.dumps(sample_model_input)
# Invoke model
response = bedrock_runtime.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)

response_body = json.loads(response.get('body').read())
embedding = response_body.get("embedding")
# Print response and embedding
print(f"The embedding vector has {len(embedding)} valuesn{embedding[0:3]+['...']+embedding[-3:]}")

The full notebook is available at on the Github Repo.

With Amazon Titan Text Embeddings, you can input up to 8,192 tokens, allowing you to work with phrases or entire documents based on your use case. The model returns output vectors of a range of dimensions from 256–1024 without sacrificing accuracy, while also optimizing for cost storage and low latency. Typically, you will find larger content window models tuned for accuracy while sacrificing latency because they’re typically used in asynchronous workloads. However, with its larger content window, Amazon Titan Text Embeddings is able to achieve low latency, and with batching, it gives higher throughput for your workloads.

Run your own benchmarking

We always encourage our customers to perform their own benchmarking using their documents or the standard MTEB datasets and evaluation. For a sample of how to use the MTEB, see the GitHub repo. This notebook shows you how to load the dataset and set up evaluation for your specific use case (task) and run the benchmarking. If you run the benchmarking with your dataset, the typical steps involved are:

  1. Use the Amazon Titan Text Embeddings V2 model to embed your data and keywords, adjusting the chunk size and overlap as needed.
  2. Run similarity searches using your preferred distance metrics based on your choice of vector database.

A sample notebook showing how to use an in-memory database is available in the GitHub repo. This is a sample setup and should not be used for your production workloads where you would be connecting to robust vector database offerings like Amazon OpenSearch Serverless.

Migrate to Amazon Titan Text Embeddings V2

The cost and performance advantages provided by the V2 model are compelling reasons to consider reindexing your existing vector embeddings using V2. Let’s explore a few examples to illustrate the potential benefits, focusing solely on embedding costs.

Use case 1: High volume of searches

This first use case pertains to customers with a high volume of searches. The details are as follows:

  • Scenario:
    • 1 million documents, 100 million chunks, 1,000 average tokens per chunk
    • 100,000 searches per day, 1,000 token size for search
  • One-time cost:
    • Number of tokens: 100,000 million
    • Price per million tokens: $0.02
    • Reindexing cost: 100,000 * $0.02 = $2,000
  • Ongoing monthly savings (compared to V1):
    • Tokens embedded per month: 30 * 100,000 * 1,000 = 3,000 million
    • Savings per month (when migrating from V1 to V2): 3,000 * ($0.1 – $0.02) = $240

For this use case, the one-time reindexing cost of $2,000 will likely break even within 8–9 months through the ongoing monthly savings.

Use case 2: Ongoing indexing

This use case is for customers with ongoing indexing. The details are as follows:

  • Scenario:
    • 500,000 documents, 50 million chunks, average 1,000 tokens per chunk
    • 10,000 (2%) new documents added per month
    • 1,000 searches per day, 1,000 token size for search
  • One-time cost:
    • Number of tokens: 50,000 million
    • Price per million tokens: $0.02
    • Reindexing cost: 50,000 * $0.02 = $1,000
  • Ongoing monthly savings (compared to V1):
    • Tokens embedded per month for storage: 1,000 * 1,000 * 1,000 = 1,000 million
    • Tokens embedded per month for search: 30 * 1,000 * 1,000 = 30 million
    • Savings per month (vs. V1): 1,030 * ($0.1 – $0.02) = $82.4

For this use case, the one-time reindexing cost of $1,000 nets an estimated monthly savings of $82.4.

These calculations do not account for the additional savings due to the reduced storage size (up to four times) with V2. This could translate into further cost savings in terms of your vector database storage requirements. The extent of these savings will vary depending on your specific data storage needs.

Conclusion

In this post, we introduced the new Amazon Titan Text Embeddings V2 model, with superior performance across various use cases like retrieval, reranking, and multilingual tasks. You can potentially realize substantial cost savings and performance improvements by reindexing your vector embeddings using the V2 model. The specific benefits will vary based on factors such as the volume of data, search traffic, and storage requirements, but the examples discussed in this post illustrate the potential value proposition. Amazon Titan Text Embeddings V2 is available today in the us-east-1 and us-west-2 AWS Regions.


About the authors

Shreyas Subramanian is a Principal AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Pradeep Sridharan is a Senior Solutions Architect at AWS. He has years of experience in digital business transformation—designing and implementing solutions to drive market competitiveness and revenue growth across multiple sectors. He  specializes in AI/ML, Data Analytics and Application Modernization and Migration. Pradeep is based in Arizona (US).

Anuradha Durfee is a Senior Product Manager at AWS working on generative AI. She has spent the last five years working on natural language understanding and is motivated by enabling life-like conversations between humans and technology. Anuradha is based in Boston, MA.

Read More

NVIDIA AI Microservices for Drug Discovery, Digital Health Now Integrated With AWS

NVIDIA AI Microservices for Drug Discovery, Digital Health Now Integrated With AWS

Harnessing optimized AI models for healthcare is easier than ever as NVIDIA NIM, a collection of cloud-native microservices, integrates with Amazon Web Services.

NIM, part of the NVIDIA AI Enterprise software platform available on AWS Marketplace, enables developers to access a growing library of AI models through industry-standard application programming interfaces, or APIs. The library includes foundation models for drug discovery, medical imaging and genomics, backed by enterprise-grade security and support.

NIM is now available via Amazon SageMaker — a fully managed service to prepare data and build, train and deploy machine learning models — and AWS ParallelCluster, an open-source tool to deploy and manage high performance computing clusters on AWS. NIMs can also be orchestrated using AWS HealthOmics, a purpose-built service for biological data analysis.

Easy access to NIM will enable the thousands of healthcare and life sciences companies already using AWS to deploy generative AI more quickly, without the complexities of model development and packaging for production. It’ll also help developers build workflows that combine AI models across different modalities, such as amino acid sequences, MRI images and plain-text patient health records.

Presented today at the AWS Life Sciences Leader Symposium in Boston, this initiative extends the availability of NVIDIA Clara accelerated healthcare software and services on AWS — which include fast and easy-to-deploy NIMs from NVIDIA BioNeMo for drug discovery, NVIDIA MONAI for medical imaging workflows and NVIDIA Parabricks for accelerated genomics.

Pharma and Biotech Companies Adopt NVIDIA AI on AWS

BioNeMo is a generative AI platform of foundation models, training frameworks, domain-specific data loaders and optimized training recipes​ that support the training and fine-tuning of biology and chemistry models on proprietary data.​ It’s used by more than 100 organizations globally.

Amgen, one of the world’s leading biotechnology companies, has used the BioNeMo framework to train generative models for protein design, and is exploring the potential use of BioNeMo with AWS.

BioNeMo models for protein structure prediction, generative chemistry and molecular docking prediction are available as NIM microservices, pretrained and optimized to run on any NVIDIA GPU or cluster of GPUs. These models can be combined to support a holistic, AI-accelerated drug discovery workflow.

Biotechnology company A-Alpha Bio harnesses synthetic biology and AI to measure, predict and engineer protein-to-protein interactions. When its researchers moved from a generic version of the ESM-2 protein language model to a version optimized by NVIDIA running on NVIDIA H100 Tensor Core GPUs on AWS, they immediately saw a speedup of more than 10x. This lets the team sample a much more extensive field of protein candidates than they would have otherwise.

For organizations that want to augment these models with their own experimental data, NIM enables developers to enhance a model with retrieval-augmented generation, or RAG — known as a lab-in-the-loop design.

Parabricks Enables Accelerated Genomics Pipelines

NVIDIA NIM includes genomics models from NVIDIA Parabricks, which are also available on AWS HealthOmics as Ready2Run workflows that enable customers to deploy pre-built pipelines.

Life sciences company Agilent used Parabricks genomics analysis tools running on NVIDIA GPU-powered Amazon Elastic Compute Cloud (EC2) instances to significantly improve processing speeds for variant calling workflows on the company’s cloud-native Alissa Reporter software. Integrating Parabricks with Alissa secondary analysis pipelines enables researchers to access rapid data analysis in a secure cloud environment.​

Conversational AI Technology Supports Digital Health

In addition to models that can decode proteins and genomic sequences, NIM microservices offer optimized large language models for conversational AI and visual generative AI models for avatars and digital humans.

AI-powered digital assistants can enhance healthcare by answering patient questions and supporting clinicians with logistics. Trained on healthcare organization-specific data using RAG, they could connect to relevant internal data sources to synthesize research, surface insights and improve productivity.

Generative AI startup Hippocratic AI is in the final stages of testing AI-powered healthcare agents that focus on a wide range of tasks including wellness coaching, preoperative outreach and post-discharge follow-up.

The company, which uses NVIDIA GPUs through AWS, is adopting NVIDIA NIM and NVIDIA ACE microservices to power a generative AI agent for digital health.

The team used NVIDIA Audio2Face facial animation technology, NVIDIA Riva automatic speech recognition and text-to-speech capabilities, and more to power a healthcare assistant avatar’s conversation.

Experiment with NVIDIA NIMs for healthcare and get started with NVIDIA Clara on AWS.

Subscribe to NVIDIA healthcare news.

Read More

GeForce NOW Delivers 24 A-May-zing Games This Month

GeForce NOW Delivers 24 A-May-zing Games This Month

GeForce NOW brings 24 new games for members this month.

Ninja Theory’s highly anticipated Senua’s Saga: Hellblade II will be coming to the cloud soon — get ready by streaming the first in the series, Hellblade: Senua’s Sacrifice, part of the seven new games joining the GeForce NOW library this week.

Plus, game across more devices than ever as GeForce NOW adds improved support on Steam Deck this GFN Thursday.

Journey into Viking Hell

Hellblade 1 on GeForce NOW
No sacrificing frame rates, streaming from the cloud.

Experience exceptional storytelling in Ninja Theory’s award-winning game Hellblade: Senua’s Sacrifice, available to stream from the cloud this week.

Set in a dark fantasy world inspired by Norse mythology and Celtic culture, the game follows the journey of Senua, a Pict warrior. Her quest is to reach Helheim, the realm of the dead,to rescue her deceased lover’s soul from the goddess Hela.

Solve intricate puzzles with observation, engage in melee combat and get pulled deep into Senua’s mind as she grapples with inner demons. Journey through the hauntingly beautiful landscapes of Helheim with ray tracing and high-dynamic range using an Ultimate or Priority membership for the most immersive and stunning visual fidelity.

Decked Out

Thanks to GeForce NOW, nearly any device can perform like a GeForce-powered PC gaming rig. Members who want to stream their favorite PC games to Valve’s Steam Deck now have an easier way to get started.

Members can use a new beta installation method to automatically configure GeForce NOW’s browser in Steam Deck’s Gaming Mode. The installation script automatically installs Google Chrome to the device, then adds all the settings needed to help members log into GeForce NOW and stream their favorite games.

The latest GeForce NOW update, released last week, also allows members to navigate GeForce NOW on a browser using a gamepad, including on the Steam Deck. That means it’s even easier to find and play supported titles with NVIDIA DLSS technology or real-time ray tracing, regardless of the handheld’s system specs.

Plus, members can easily play non-Steam games on the Steam Deck thanks to the cloud. This includes games from Battle.net, Epic Games Store, Ubisoft Connect, GOG.com and Xbox, as well as supported PC Game Pass titles. No more worrying about downloads or backend configurations. And with more than 1,900 games supported, there’s always something new to stream.

Steam Deck is just one of many popular handheld PC devices with support for GeForce NOW. Others include ASUS ROG Ally, Logitech G Cloud, Lenovo Legion Go, MSI Claw and Razer Edge. Get started now. Learn more about how to configure GeForce NOW on Steam Deck.

May New Games Be With You

Foundry on GeForce NOW
If you build it, they will come.

Get complete creative freedom in Paradox Interactive’s FOUNDRY, an exciting first-person factory-building sandbox game set in an infinite voxel world for expansive, ever-changing landscapes. Build a factory optimized to perfection or an artistic masterpiece, harvest resources, automate ever-growing production lines and manage complex systems to achieve mechanical mastery in FOUNDRY.

Check out the list of new games this week:

  • Hellblade: Senua’s Sacrifice (Steam and Xbox, available on PC Game Pass)
  • Stormgate Closed Beta (New release on Steam, April 30, sign up for access)
  • Gray Zone Warfare (New release on Steam, April 30)
  • MotoGP24 (New release on Steam, May 2)
  • FOUNDRY (New release on Steam, May 2)
  • INDIKA (New release on Steam, May 2)
  • Orcs Must Die! 3 (New release on Epic Games Store, May 2)

And members can look for the following throughout the rest of the month:

  • Little Kitty, Big City (New release on Steam and Xbox, available on PC Game Pass, May 9)
  • Ships at Sea (New release on Steam, May 9)
  • The Rogue Prince of Persia  (New release on Steam, May 14)
  • Men of War II (New release on Steam, May 15)
  • Die by the Blade (New release on Steam, May 16)
  • Norland (New release on Steam, May 16)
  • Gestalt: Steam & Cinder (New release on Steam, May 21)
  • Synergy (New release on Steam, May 21)
  • SunnySide (New release on Steam, May 21)
  • Crown Wars: The Black Prince (New release on Steam, May 23)
  • Capes (New release on Steam, May 29)
  • Colony Survival (Steam)
  • Exo One (Steam)
  • Farmer’s Life (Steam)
  • Honkai: Star Rail (Epic Games Store)
  • Phantom Brigade (Steam)
  • Supermarket Simulator (Steam)

April Showers Brought More Titles

In addition to the 19 games announced last month, 20 more joined the GeForce NOW library:

  • Gigantic: Rampage Edition (New release on Steam, April 9)
  • Inkbound 1.0 (New release, on Steam, April 9)
  • Broken Roads (New release on Steam, April 10)
  • Infection Free Zone (New release on Steam, April 11)
  • Shadow of the Tomb Raider: Definitive Edition (New release on Xbox and available on PC Game Pass, April 11)
  • Ghostrunner (Epic Games Store, free April 11-18)
  • Kill It With Fire 2 (New release on Steam, April 16)
  • The Crew Motorfest (New release on Steam, April 18)
  • No Rest for the Wicked (New release on Steam, April 18)
  • Bellwright (New release on Steam, April 23)
  • Age of Water (New release on Steam, April 25)
  • Diablo II: Resurrected (Battle.net)
  • Diablo III (Battle.net)
  • Fallout 4 (Steam)
  • Fallout 76 (Steam and Xbox, available on PC Game Pass)
  • StarCraft Remastered (Battle.net)
  • StarCraft II (Battle.net)
  • Stargate: Timekeepers (Steam)
  • Terra Invicta (Xbox, available on PC Game Pass)
  • Tomb Raider I-III Remastered (Steam)

Riot Games has rolled out the League of Legends 14.9 update globally, which adds the Vanguard security software to the game. Since Vanguard doesn’t support virtual machines like GeForce NOW, the game is under maintenance and no longer playable on the cloud gaming platform for the foreseeable future. Work is underway to find a solution for GeForce NOW members.

Lightyear Frontier (Xbox) didn’t make it in April due to technical issues. Stay tuned to GFN Thursday for updates.

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Announcing PyTorch Docathon June, 2024

We are thrilled to announce the upcoming PyTorch Docathon in June! The Docathon, akin to a hackathon, is an event dedicated to enhancing the quality of the PyTorch documentation with the invaluable assistance of our community. Documentation is a vital component of any technology. By refining it, we can simplify the process for new users to get started with PyTorch, guide them in effectively utilizing its features, and ultimately expedite the transition from research to production in machine learning. See our previous events here and here.

Why Participate

The Docathon is an inclusive event designed to be accessible to newcomers, requiring only a basic understanding of Python, PyTorch, and Machine Learning, with some tasks not even requiring these skills. It offers a rewarding experience as participants can see the direct impact of their contributions on the project’s usability and accessibility. The Docathon promotes a collaborative environment, allowing participants to work with other contributors and PyTorch maintainers, fostering the exchange of ideas and networking. It also provides a rich learning experience, offering the opportunity to explore PyTorch modules, update docstrings, and test tutorials.

Event Details

June 4: Kick-off
June 4-June 16: Submissions and Feedback
June 17-18: Final Reviews
June 20: Winner Announcements

Further details for the Docathon will be announced at the Kick-off call on June 4.

Please register to join this year’s event.

Read More

A Hitchhiker’s Guide to Speculative Decoding

A Hitchhiker’s Guide to Speculative Decoding

Speculative decoding is an optimization technique for inference that makes educated guesses about future tokens while generating the current token, all within a single forward pass. It incorporates a verification mechanism to ensure the correctness of these speculated tokens, thereby guaranteeing that the overall output of speculative decoding is identical to that of vanilla decoding. Optimizing the cost of inference of large language models (LLMs) is arguably one of the most critical factors in reducing the cost of generative AI and increasing its adoption. Towards this goal, various inference optimization techniques are available, including custom kernels, dynamic batching of input requests, and quantization of large models.

In this blog post, we provide a guide to speculative decoding and demonstrate how it can coexist with other optimizations. We are proud to open source the following, which includes the first speculator for Llama3 models:

  1. Speculator models for Meta Llama3 8B, IBM Granite 7B lab, Meta Llama2 13B, and Meta Code Llama2 13B.
  2. The code for inference via IBM’s fork of HF TGI.
  3. The code for training your own speculators and corresponding recipes.

We have deployed these speculators in an internal production-grade environment with thousands of daily users and observed 2x speedup on language models – Llama3 8B, Llama2 13B, and IBM Granite 7B and 3x speedup on IBM’s Granite 20B code models. We provide a detailed explanation of our approach in this technical report and are planning in-depth analysis in an upcoming ArXiv paper.

Speculative decoding: Inference

We run IBM TGIS in our internal production environment that has optimizations such as continuous batching, fused kernels, and quantization kernels. To enable speculative decoding in TGIS, we modified the paged attention kernel from vLLM. In what follows, we will describe the key changes to the inference engine to enable speculative decoding.

Speculative decoding is based on the premise that the model is powerful enough to predict multiple tokens in a single forward pass. However, the current inference servers are optimized to predict only a single token at a time. In our approach, we attach multiple speculative heads (in addition to the usual one) to the LLM to predict N+1-, N+2-, N+3-th … token. For example, 3 heads will predict 3 additional tokens. Details of the speculator architecture are explained in a later part of this blog. There are two challenges to achieve efficiency and correctness during inference – one is to predict without replicating KV-cache and the other is to verify that the predictions match the original model’s outcomes.

In a typical generation loop, after the prompt is processed in a single forward step, a sequence length of 1 (next token predicted) is fed into the forward pass of the model along with the kv-cache. In a naive speculative decoding implementation, each speculative head would have its own kv-cache, but instead we modify the paged attention kernel developed in the vLLM project to enable efficient kv-cache maintenance. This ensures that throughput does not reduce at larger batch sizes. Further, we modify the attention masks to enable verification of the N+1’th token and thus enable speculative decoding without deviating from the original model’s output. The details of this implementation are captured here.

Results

We illustrate the speedup obtained with the Meta’s chat versions of Llama2 13B using a simple prompt.

Visual illustration of the non-speculative generation (left) compared to speculative generation (right)

Figure 2: Visual illustration of the non-speculative generation (left) compared to speculative generation (right)

We deployed the above solution in an internal production environment. The figure below reports two metrics – time to first token (TTFT) and inter-token latency (ITL) with different numbers of concurrent users (which is captured in the numbers on the graph lines). We observe that the speculative decoding version is nearly twice as fast for the Llama2 13B chat model and nearly thrice as fast for the Granite 20B code model compared to the non-speculative version for all batch sizes. We observe similar behavior for the smaller models – IBM’s Granite 7B and Meta Llama3 8B models.

Time to first token (TTFT - left) and Inter-token latency (ITL - right) for Llama 13B with number of concurrent users indicated on the graph

Figure 3: Time to first token (TTFT – left) and Inter-token latency (ITL – right) for Llama 13B with number of concurrent users indicated on the graph

Time to first token (TTFT - left) and Inter-token latency (ITL - right) for Granite 20B Code with number of concurrent users indicated on the graph

Figure 4: Time to first token (TTFT – left) and Inter-token latency (ITL – right) for Granite 20B Code with number of concurrent users indicated on the graph

Note on efficiency

We performed numerous experiments to determine the right configuration for speculator training. These are:

  1. Speculator architecture: The current approach allows for the number of heads to be modified, which maps to the number of tokens that we can look ahead. Increasing the number of heads also increases the amount of extra compute needed and complexity of training. In practice, for language models, we find 3-4 heads works well in practice, whereas we found that code models can reap benefits from 6-8 heads.
  2. Compute: Increasing the number of heads results in increased compute in two dimensions, one is that of increased latency for a single forward pass as well as the compute needed for multiple tokens. If the speculator is not accurate with more heads, it will result in wasted compute increasing the latency and reducing the throughput.
  3. Memory: The increased compute is offset by the roundtrips to HBM that need to be done for each forward pass. Note that if we get 3 tokens lookahead correct, we have saved three round trip times on HBM.

We settled on 3-4 heads for the language models and 6-8 heads for the code models and across different model sizes ranging from 7B to 20B, we observed significant latency improvements without throughput loss compared to non-speculative decoding. We begin to observe throughput reduction beyond a batch size of 64, which happens rarely in practice.

Speculative decoding: Training

There are two broad approaches for speculative decoding, one is to leverage a smaller model (e.g., Llama 7B as a speculator for Llama 70B) and the other is to attach speculator heads (and train them). In our experiments, we find the approach of attaching speculator heads to be more effective both in model quality and latency gains.

Speculator architecture

Medusa made speculative decoding popular; their approach is to add a head to the existing model which is then trained to do speculation. We modify the Medusa architecture by making the “heads” hierarchical, where each head stage predicts a single token and then feeds it to the next head stage. These multi-stage heads are depicted in the below figure. We are exploring ways of minimizing the embeddings table by sharing these across the multiple stages and base model.

A simple architecture diagram for a 3-headed multi-stage  speculator. Z is the state from the base model.

Figure 4: A simple architecture diagram for a 3-headed multi-stage speculator. Z is the state from the base model.

Speculator training

We have a two-phase approach to training a speculator for efficiency reasons. In the first phase, we train on small batches with long sequence lengths (4k tokens) and use the standard causal LM approach for training. In phase 2, we use large batches with short sequence lengths (256 tokens) generated from the base model. In this training phase, we tune the heads to match the output of the base model. Through numerous experiments, we find that a 5:2 ratio of steps for phase 1 vs phase 2 works well. We depict the progress of these phases in the below figure. We use PyTorch FSDP and IBM FMS for the training of speculators.

Per-head training loss curves for Llama2-13B speculator training, phase 1 and 2

Figure 5: Per-head training loss curves for Llama2-13B speculator training, phase 1 and 2

Conclusion and Future Work

Through this blog, we are releasing a new approach for speculative decoding and the following assets:

  1. Models for improving the inter-token latencies for a range of models – Llama3 8B, Llama2 13B, Granite 7B, and CodeLlama 13B
  2. Production quality code for inference
  3. Recipes for training speculators

We are working on training speculators for Llama3 70B and Mistral models and invite the community to contribute as well as help improve on our framework. We would also love to work with major open source serving frameworks such as vLLM and TGI to contribute back our speculative decoding approach to benefit the community.

Acknowledgements

There are several teams that helped us get to these latency improvements for inference. We would like to thank the vLLM team for creating the paged attention kernel in a clean and reusable manner. We extend our gratitude to the Team PyTorch at Meta that helped provide feedback on this blog as well as continued efforts on optimal usage of PyTorch. Special thanks to our internal production teams at IBM Research who took this prototype to production and hardened it. A shout out to Stas Bekman for providing insightful comments on the blog resulting in an improved explanation of the tradeoffs between compute, memory, and speculator effectiveness.

The paged attention kernel was integrated into IBM FMS by Josh Rosenkranz and Antoni Viros i Martin. The speculator architecture and training was done by Davis Wertheimer, Pavithra Ranganathan, and Sahil Suneja. The integration of the modeling code with the inference server was done by Thomas Parnell, Nick Hill, and Prashant Gupta.

Read More

Guiding Instruction-based Image Editing via Multimodal Large Language Models

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing…Apple Machine Learning Research

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. Despite recent trends favoring alternative activation functions such as GELU or SiLU, known for increased computation, this study strongly advocates for reinstating ReLU activation in LLMs. We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer…Apple Machine Learning Research

Pseudo-Generalized Dynamic View Synthesis from a Video

Rendering scenes observed in a monocular video from novel viewpoints is a chal- lenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized tech- niques, which only run a deep net forward pass on a test scene. In contrast, for dy- namic scenes, scene-specific optimization techniques exist, but, to our best knowl- edge, there is currently no generalized method for dynamic novel view synthesis from a given monocular video. To explore whether generalized dynamic novel view synthesis from monocular…Apple Machine Learning Research

Explainable AI: Insights from Arthur’s Adam Wenchel

Explainable AI: Insights from Arthur’s Adam Wenchel

Arthur.ai enhances the performance of AI systems across various metrics like accuracy, explainability and fairness. In this episode of the NVIDIA AI Podcast, recorded live at GTC 2024, host Noah Kravitz sits down with Adam Wenchel, cofounder and CEO of Arthur, to discuss the challenges and opportunities of deploying generative AI. Their conversation spans a range of topics, including AI bias, the observability of AI systems and the practical implications of AI in business. For more on Arthur, visit arthur.ai.

Time Stamps:

  • 00:11: Introduction and background on Adam Wenchel and Arthur.ai.
  • 01:31: Discussion on the mission and services of Arthur.
  • 02:31: Real-world use cases of LLMs and generative AI in enterprises.
  • 06:22: Challenges in deploying AI systems internally within companies.
  • 08:23: The process of adapting AI models for specific business needs.
  • 09:26: Exploring AI observability and the importance of real-time monitoring.
  • 11:36: Addressing bias in AI systems and its implications.
  • 15:21: Wenchel’s journey from cybersecurity to AI and founding Arthur.
  • 20:38: Cybersecurity concerns with generative AI and large language models.
  • 21:37: Future of work and AI’s role in enhancing job performance.
  • 24:27: Future directions for Arthur and ongoing projects.

You Might Also Like…

ITIF’s Daniel Castro on Energy-Efficient AI and Climate Change – Ep. 215

AI-driven change is in the air, as are concerns about the technology’s environmental impact. In this episode of NVIDIA’s AI Podcast, Daniel Castro, vice president of the Information Technology and Innovation Foundation and director of its Center for Data Innovation, speaks with host Noah Kravitz about the motivation behind his AI energy use report, which addresses misconceptions about the technology’s energy consumption.

DigitalPath’s Ethan Higgins on Using AI to Fight Wildfires – Ep. 211

DigitalPath is igniting change in the golden state — using computer vision, generative adversarial networks and a network of thousands of cameras to detect signs of fire in real-time. In the latest episode of NVIDIA’s AI Podcast, host Noah Kravtiz spoke with DigitalPath system architect Ethan Higgins about the company’s role in the ALERTCalifornia initiative, a collaboration between California’s wildfire fighting agency CAL FIRE and the University of California, San Diego.

Anima Anandkumar on Using Generative AI to Tackle Global Challenges – Ep. 203

Generative AI-based models can not only learn and understand natural languages — they can learn the very language of nature itself, presenting new possibilities for scientific research. On the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Anandkumar on generative AI’s potential to make splashes in the scientific community.

How Alex Fielding and Privateer Space Are Taking on Space Debris – Ep. 196

In this episode of the NVIDIA AI Podcast, host Noah Kravitz dives into an illuminating conversation with Alex Fielding, co-founder and CEO of Privateer Space. Privateer Space, Fielding’s latest venture, aims to address one of the most daunting challenges facing our world today: space debris.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Amazon Music, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better: Have a few minutes to spare? Fill out this listener survey.

Read More

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

Large language models (LLMs) are making a significant impact in the realm of artificial intelligence (AI). Their impressive generative abilities have led to widespread adoption across various sectors and use cases, including content generation, sentiment analysis, chatbot development, and virtual assistant technology. Llama2 by Meta is an example of an LLM offered by AWS. Llama 2 is an auto-regressive language model that uses an optimized transformer architecture and is intended for commercial and research use in English. It comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pre-trained and fine-tuned variations. To learn more about Llama 2 on AWS, refer to Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart.

Many practitioners fine-tune or pre-train these Llama 2 models with their own text data to improve accuracy for their specific use case. However, in some cases, a challenge arises for practitioners: the high cost of fine-tuning and training. As organizations strive to push the boundaries of what LLMs can achieve, the demand for cost-effective training solutions has never been more pressing. In this post, we explore how you can use the Neuron distributed training library to fine-tune, continuously pre-train, and reduce the cost of training LLMs such as Llama 2 with AWS Trainium instances on Amazon SageMaker.

AWS Trainium instances for training workloads

SageMaker ml.trn1 and ml.trn1n instances, powered by Trainium accelerators, are purpose-built for high-performance deep learning training and offer up to 50% cost-to-train savings over comparable training optimized Amazon Elastic Compute Cloud (Amazon EC2) instances. This post implements a solution with the ml.trn1.32xlarge Trainium instance type, typically used for training large-scale models. However, there are also comparable ml.trn1n instances that offer twice as much networking throughput (1,600 Gbps) via Amazon Elastic Fabric Adapter (EFAv2). SageMaker Training supports the availability of ml.trn1 and ml.trn1n instances in the US East (N. Virginia) and US West (Oregon) AWS Regions, and most recently announced general availability in the US East (Ohio) Region. These instances are available in the listed Regions with On-Demand, Reserved, and Spot Instances, or additionally as part of a Savings Plan.

For more information on Trainium Accelerator chips, refer to Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker. Additionally, check out AWS Trainium Customers to learn more about customer testimonials, or see Amazon EC2 Trn1 Instances for High-Performance Model Training are Now Available to dive into the accelerator highlights and specifications.

Using the Neuron Distributed library with SageMaker

SageMaker is a fully managed service that provides developers, data scientists, and practitioners the ability to build, train, and deploy machine learning (ML) models at scale. SageMaker Training includes features that improve and simplify the ML training experience, including managed infrastructure and images for deep learning, automatic model tuning with hyperparameter optimization, and a pay-for-what-you-use billing structure. This section highlights the advantages of using SageMaker for distributed training with the Neuron Distributed library—specifically, the managed infrastructure, time-to-train, and cost-to-train benefits of its associated resiliency and recovery features, and is part of the AWS Neuron SDK used to run deep learning workloads on AWS Inferentia and AWS Trainum based instances.

In high performance computing (HPC) clusters, such as those used for deep learning model training, hardware resiliency issues can be a potential obstacle. Although hardware failures while training on a single instance may be rare, issues resulting in stalled training become more prevalent as a cluster grows to tens or hundreds of instances. Regular checkpointing helps mitigate wasted compute time, but engineering teams managing their own infrastructure must still closely monitor their workloads and be prepared to remediate a failure at all hours to minimize training downtime. The managed infrastructure of SageMaker Training includes several resiliency features that make this monitoring and recovery process streamlined:

  • Cluster health checks – Before a training job starts, SageMaker runs health checks and verifies communication on the provisioned instances. It then replaces any faulty instances, if necessary, to make sure the training script starts running on a healthy cluster of instances. Health checks are currently enabled for the TRN1 instance family as well as P* and G* GPU-based instance types.
  • Automatic checkpointing – Checkpoints from a local path (/opt/ml/checkpoints by default) are automatically copied to an Amazon Simple Storage Service (Amazon S3) location specified by the user. When training is restarted, SageMaker automatically copies the previously saved checkpoints from the S3 location back to the local checkpoint directory to make sure the training script can load and resume the last saved checkpoint.
  • Monitoring and tracking training – In the case of a node failure, it’s important to have the visibility of where the failure occurs. Using PyTorch Neuron gives data scientists the ability to track training progress in a TensorBoard. This allows you to capture the loss of the training job to determine when the training job should be stopped to identify the convergence of the model for optimal training.
  • Built-in retries and cluster repair – You can configure SageMaker to automatically retry training jobs that fail with a SageMaker internal server error (ISE). As part of retrying a job, SageMaker replaces any instances that encountered unrecoverable errors with fresh instances, reboots all healthy instances, and starts the job again. This results in faster restarts and workload completion. Cluster update is currently enabled for the TRN1 instance family as well as P and G GPU-based instance types. Practitioners can add in their own applicative retry mechanism around the client code that submits the job, to handle other types of launch errors, such as like exceeding your account quota.

For customers working with large clusters of hundreds of instances for a training job, the resiliency and recovery features of SageMaker Training can reduce total time for a model to converge by up to 20% via fewer failures and faster recovery. This also enables engineering teams to monitor and react to failures at all hours. Although SageMaker training jobs are suitable for general-purpose training use cases with customizable configurations and integration with the broader AWS ecosystem, Amazon SageMaker HyperPod is specifically optimized for efficient and resilient training of foundation models at scale. For more information on SageMaker HyperPod use cases, refer to the SageMaker HyperPod developer guide.

In this post, we use the Neuron Distributed library to continuously pre-train a Llama 2 model using tensor and pipeline parallelism using SageMaker training jobs. To learn more about the resiliency and recovery features of SageMaker Training, refer to Training large language models on Amazon SageMaker: Best practices.

Solution overview

In this solution, we use an ml.t3.medium instance type on a SageMaker Jupyter notebook to process the provided cells. We will be continuously pre-training our llama2-70b model using the trn1.32xlarge Trainium instance. First, let’s familiarize ourselves with the techniques we use to handle the distribution of the training job created in our solution to contiuously pre-train our llama2-70b model using the Neuron distributed training library.

The techniques used to convert the pre-trained weights in the convert_pretrained_weights.ipynb notebook into a .pt (PyTorch) weights file are called pipeline parallelism and tensor parallelism:

  • Pipeline parallelism involves a training strategy that combines elements of pipeline parallelism to optimize the training process by splitting a batch or deep neural network into multiple microbatches or layers, allowing each stage worker to process one microbatch.
  • Tensor parallelism splits tensors of a neural network into multiple devices. This technique allows models with large tensors that can’t fit into the memory of a single device.

After we convert our pre-trained weights with the preceding techniques in our first notebook, we follow two separate notebooks in the same sagemaker-trainium-examples folder. The second notebook is Training_llama2_70b.ipynb, which walks through the continuous pre-training process by saving our checkpoint of converted model weights in the first notebook and prepping it for inference. When this step is complete, we can run the Convert_Nxd_to_hf.ipynb notebook, which takes our pre-trained weights using the NeuronX library and converts it into a readable format in Hugging Face to serve inference.

Prerequisites

You need to complete some prerequisites before you can run the first notebook.

First, make sure you have created a Hugging Face access token so you can download the Hugging Face tokenizer to be used later. After you have the access token, you need to make a few quota increase requests for SageMaker. You need to request a minimum of 8 Trn1 instances ranging to a maximum of 32 Trn1 instances (depending on time-to-train and cost-to-train trade-offs for your use case).

On the Service Quotas console, request the following SageMaker quotas:

  • Trainium instances (ml.trn1.32xlarge) for training job usage: 8–32
  • ml.trn1.32xlarge for training warm pool usage: 8–32
  • Maximum number of instances per training job: 8–32

It may take up to 24 hours for the quota increase to get approved. However, after submitting the quota increase, you can go to the sagemaker-trainium-examples GitHub repo and locate the convert_pretrained_weights.ipynb file. This is the file that you use to begin the continual pre-training process.

Now that you’re ready to begin the process to continuously pre-train the llama2-70b model, you can convert the pre-trained weights in the next section to prep the model and create the checkpoint.

Getting started

Complete the following steps:

  1. Install all the required packages and libraries: SageMaker, Boto3, transformers, and datasets.

These packages make sure that you can set up your environment to access your pre-trained Llama 2 model, download your tokenizer, and get your pre-training dataset.

!pip install -U sagemaker boto3 --quiet
!pip install transformers datasets[s3] --quiet
  1. After the packages are installed, retrieve your Hugging Face access token, and download and define your tokenizer.

The tokenizer meta-llama/Llama-2-70b-hf is a specialized tokenizer that breaks down text into smaller units for natural language processing. This tokenized data will later be uploaded into Amazon S3 to allow for running your training job.

from huggingface_hub.hf_api
import HfFolder
# Update the access token to download the tokenizer
access_token = "hf_insert-key-here"
HfFolder.save_token(access_token)

from transformers import AutoTokenizer
tokenizer_name = "meta-llama/Llama-2-70b-hf"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
block_size = 4096
  1. After following the above cells, you will now download the wikicorpus dataset from the Hugging Face dataset.
  2. Tokenize the dataset with the llama-2 tokenizer that you just initialized.

By tokenizing the data, you’re preparing to pre-train your Llama 2 model to enhance the model’s performance to expose it to the trilingual (Catalan, English, Spanish) text data in the wikicorpus dataset to learn intricate patterns and relationships in the dataset.

After the data is tokenized, run the following cell to store the training dataset to s3:

# save training dataset to s3
training_input_path = f's3://{sess.default_bucket()}/neuronx_distributed/data'
print(f"uploading training dataset to: {training_input_path}")
train_dataset.save_to_disk(training_input_path)

print(f"uploaded data to: {training_input_path}")

The cell above makes sure that you define the training_input_path and have uploaded the data to your S3 bucket. You’re now ready to begin the training job process.

Run the training job

For the training job, we use the trn1.32xlarge instances with each of the instances having 32 neuron cores. We use tensor parallelism and pipeline parallelism, which allows you to shard the model across Neuron cores for training.

The following code is the configuration for pretraining llama2-70b with trn1:

#Number of processes per node
PROCESSES_PER_NODE = 32
# Number of instances within the cluster, change this if you want to tweak the instance_count parameter
WORLD_SIZE = 32
# Global batch size
GBS = 512
# Input sequence length
SEQ_LEN = 4096
# Pipeline parallel degree
PP_DEGREE = 8<br /># Tensor parallel degree
TP_DEGREE = 8
# Data paralell size
DP = ((PROCESSES_PER_NODE * WORLD_SIZE / TP_DEGREE / PP_DEGREE))
# Batch size per model replica
BS = ((GBS / DP))
# Number microbatches for pipeline execution. Setting same as BS so each microbatch contains a single datasample
NUM_MICROBATCHES = BS
# Number of total steps for which to train model. This number should be adjusted to the step number when the loss function is approaching convergence.
MAX_STEPS = 1500
# Timeout in seconds for training. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
MAX_RUN = 2 * (24 * 60 * 60)

Now you can define the hyperparameters for training. Note that adjusting these parameters based on hardware capabilities, dataset characteristics, and convergence requirements can significantly impact training performance and efficiency.

The following is the code for the hyperparameters:

hyperparameters = {}
hyperparameters["train_batch_size"] = int(BS)
hyperparameters["use_meta_device_init"] = 1
hyperparameters["training_dir"] = "/opt/ml/input/data/train" # path where sagemaker uploads the training data
hyperparameters["training_config"] = "config.json" # config file containing llama 70b configuration , change this for tweaking the number of parameters.

hyperparameters["max_steps"] = MAX_STEPS
hyperparameters["seq_len"] = SEQ_LEN
hyperparameters["pipeline_parallel_size"] = PP_DEGREE
hyperparameters["tensor_parallel_size"] = TP_DEGREE
hyperparameters["num_microbatches"] = int(NUM_MICROBATCHES)
hyperparameters["lr"] = 0.00015
hyperparameters["min_lr"] = 1e-05
hyperparameters["beta1"] = 0.9
hyperparameters["beta2"] = 0.95
hyperparameters["weight_decay"] = 0.1
hyperparameters["warmup_steps"] = 2000
hyperparameters["constant_steps"] = 0
hyperparameters["use_zero1_optimizer"] = 1
hyperparameters["tb_dir"] = "/opt/ml/checkpoints/tensorboard" # The tensorboard logs will be stored here and eventually pushed to S3.

Now you specify the Docker image that will be used to train the model on Trainium:

docker_image = f"763104351884.dkr.ecr.{region_name}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.18.0-ubuntu20.04"

The image we defined is designed for PyTorch training with Neuron optimizations. This image is configured to work with PyTorch, using Neuron SDK version 2.18.0 for enhanced performance and efficiency on Trn1 instances equipped with AWS Trainium chips. This image is also compatible with Python 3.10, indicated by the py310, and is based on Ubuntu 20.04.

Prior to starting your training job, you need to configure it by defining all necessary variables. You do so by defining the training job name, checkpoint directory, and cache directory:

import time
# Define Training Job Name
job_name = f'llama-neuron-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
# Define checkpoint directory that contains the weights and other relevant data for the trained model
checkpoint_s3_uri = "s3://" + sagemaker_session_bucket + "/neuron_llama_experiment"
checkpoint_dir = '/opt/ml/checkpoints'</p><p>
In [ ]:
# Define neuron chache directory
cache_dir = "/opt/ml/checkpoints/neuron_cache"

The parameters enable you to do the following:

  • The training job allows you to identify and track individual training jobs based on timestamps
  • The checkpoint directory specifies the S3 URI where the checkpoint data, weights, and other information are stored for the trained model
  • The cache directory helps optimize the training process by storing and reusing previously calculated values, from the checkpoint directory, reducing redundancy and improving efficiency
  • The environment variables make sure that the training job is optimally configured and settings are tailored to enable efficient and effective training using features like RDMA, optimized memory allocation, fused operations, and Neuron-specific device optimizations

After you have defined your training job and configured all directories and environment variables for an optimal training pipeline, you now set up your PyTorch estimator to begin the training job on SageMaker. A SageMaker estimator is a high-level interface that handles the end-to-end SageMaker training and deployment tasks.

The entry_point is specified as the Python script run_llama_nxd.py. We use the instance_type ml.trn1.32xlarge, the instance count is 32 (which was previously defined as a global variable in the configuration code), and input_mode is set to FastFile. Fast File mode in SageMaker streams data from Amazon S3 on demand, which optimizes data loading performance by fetching data as needed, reducing overall resource consumption. For more information on input, refer to Access Training Data.

from sagemaker.pytorch import PyTorch

# Handle end-to-end Amazon SageMaker training and deployment tasks.
pt_estimator = PyTorch(<br />entry_point='run_llama_nxd.py',
source_dir='./scripts',<br />instance_type="ml.trn1.32xlarge",
image_uri=docker_image,<br />instance_count=WORLD_SIZE,
max_run=MAX_RUN,
hyperparameters=hyperparameters,
role=role,
base_job_name=job_name,
environment=env,
input_mode="FastFile",
disable_output_compression=True,
keep_alive_period_in_seconds=600, # this is added to enable warm pool capability
checkpoint_s3_uri=checkpoint_s3_uri,
checkpoint_local_path=checkpoint_dir,
distribution={"torch_distributed": {"enabled": True}} # enable torchrun
)

Finally, you can start the training job with the SageMaker fit() method, which trains the model based on the defined hyperparameters:

# Start training job
pt_estimator.fit({"train": training_input_path})

You have successfully started the process to continuously pre-train a llama2-70b model by converting pre-trained weights with tokenized data using SageMaker training on Trainium instances.

Continuous pre-training

After following the prerequisites, completing the provided notebook, and converting the pre-trained weights as a checkpoint, you can now begin the continual pre-training process, using the checkpoint as a point of reference to pre-train the llama2-70b model. The techniques used to convert the pre-trained weights in the convert_pretrained_weights.ipynb notebook into a .pt (PyTorch) weights file are called pipeline parallelism and tensor parallelism.

To begin the continuous pre-training process, follow the Training_llama2_70b.ipynb file in the sagemaker-trainium-examples repo.

Given the large size of the llama2-70b model, you need to convert the pre-trained weights into a more efficient and useable format (.pt). You can do so by defining the hyperparameters in your configuration to store converted weights and checkpoints. The following are the hyperparameters:

# Use the sagemaker s3 checkpoints mechanism since we need read/write access to the paths.
hyperparameters["output_dir"] = "/opt/ml/checkpoints/llama70b_weights"
hyperparameters["checkpoint-dir"] = '/opt/ml/checkpoints'<br />hyperparameters["n_layers"] = 80
hyperparameters["convert_from_full_model"] = ""

If you look at the hyperparameters, the output_dir is used as a reference for pre-training. If you are at this cell, you should have already followed the Training_llama2_70b.ipynb notebook and gone through the process of setting up your SageMaker client and Docker image, and preparing the pre-trained weights for pre-training. You’re now ready to perform the continuous pre-training process on the llama2-70b model.

We use the following parameters to take the pre-trained weights stored in output_dir in the convert_pretrained_weights.ipynb file to be reused continuously for pre-training:

hyperparameters["checkpoint_dir"] = "/opt/ml/checkpoints/checkpts"
hyperparameters["checkpoint_freq"] = 10
hyperparameters["num_kept_checkpoint"] = 1
hyperparameters["use_zero1_optimizer"] = 1
hyperparameters["save_load_xser"] = 0
hyperparameters["pretrained_weight_dir"] = "/opt/ml/checkpoints/llama70b_weights"

After these hyperparameters are implemented, you can run the rest of the notebook cells to complete the continuous pre-training process. After the SageMaker estimator has completed the training job, you can locate the new checkpoint in the S3 checkpoint directory containing the weights. You can now locate the convert_Nxd_to_hf.ipynb file to get the checkpoint ready for inferencing.

Convert the Neuron Distributed checkpoint for inferencing

Checkpoints play a vital role in the context of distributed training with the NeuronX library because it has checkpoint compatibility with Hugging Face Transformers. You can get the training job output ready for inferencing by taking the training job that is saved as a NeuronX distributed checkpoint and converting the weights into .pt weights files.

To convert the checkpoints to Hugging Face format using NeuronX, you first need to save the S3 nxd_checkpoint_path directory:

# S3 checkpoint directory that contains the weights and other relevant data from the continuous pre-trained model
checkpoint_s3_uri = "&lt;pre-training-checkpoint-s3-uri&gt;"
nxd_checkpoint_path = f"s3://{checkpoint_s3_uri}/neuronx_llama_experiment/checkpts/step10/model/"
# Checkpoint is saved as part of Notebook 2

After you save the checkpoint in the nxd_checkpoint_path directory, you can save your hyperparameters and configure your SageMaker estimator, which makes sure the pre-training process can begin. You can now run the fit() function within the estimator to convert the pre-trained weights into a checkpoint for inferencing with the following cell:

# Start SageMaker job
estimator.fit({"checkpoint": nxd_checkpoint_path})

Summary

You have successfully performed continuous pre-training on a llama2-70b model by converting your pre-trained weights and checkpoint to be used to serve inference using the Neuron SDK and Trainium instances. By following the solution in this post, you should now know how to configure a pipeline for continuous pre-training of an LLM using SageMaker and Trainium accelerator chips.

For more information on how to use Trainium for your workloads, refer to the Neuron SDK documentation or reach out directly to the team. We value customer feedback and are always looking to engage with ML practitioners and builders. Feel free to leave comments or questions in the comments section.


About the authors

Marco Punio is a Solutions Architect focused on generative AI strategy, applied AI solutions and conducting research to help customers hyperscale on AWS. He is a qualified technologist with a passion for machine learning, artificial intelligence, and mergers & acquisitions. Marco is based in Seattle, WA and enjoys writing, reading, exercising, and building applications in his free time.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and Data Analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, train, and migrate ML production workloads to SageMaker at scale. He specializes in deep learning, especially in the area of NLP and CV. Outside of work, he enjoys running and hiking.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads frameworks, compilers, and optimization techniques for deep learning training.

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect, and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.

Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the world with his wife.

Read More