Using causal random forests and Bayesian structural time series to extrapolate from sparse data ensures that customers get the most useful information as soon as possible.Read More
Accelerate software development and leverage your business data with generative AI assistance from Amazon Q
We believe generative artificial intelligence (AI) has the potential to transform virtually every customer experience. To make this possible, we’re rapidly innovating to provide the most comprehensive set of capabilities across the three layers of the generative AI stack. This includes the bottom layer with infrastructure to train Large Language Models (LLMs) and other Foundation Models (FMs) and produce inferences or predictions, the middle layer with tools to easily and rapidly build generative AI applications, and the top-layer where we’re investing in game-changing applications. While all of these layers are important for the advancement of generative AI, I’m excited today to share more on our investments in the top application layer.
With the assistance of generative AI, everyone from developers and business analysts to employees in specialized areas like customer service or supply chain operations, can be more productive, creative, and data-driven than ever before. But for generative AI apps and assistants to be truly useful at work, they must know an organization’s data, their customers, their operations, and their business. Many of today’s assistants can’t be easily personalized and they weren’t designed to meet the data privacy and security requirements companies need.
That’s why we invented Amazon Q, the most capable generative AI-powered assistant for accelerating software development and leveraging business data. I’m excited to share that Amazon Q Developer, Amazon Q Business, and Amazon Q in QuickSight are available today along with several new features. You can see a quick breakdown of today’s announcement and demos in this video from Dr. Matt Wood.
Amazon Q is the most capable work assistant available today, and we built it with security and privacy in mind from the start- if an employee can’t access a data source normally, they can’t access it through Q either.
We are bringing you a ton of great capabilities and experiences today, and I thought I’d call out a just a few.
Amazon Q Developer – your assistant for the entire software development lifecycle
Amazon Q Developer helps developers and IT professionals (IT pros) with all of their tasks—from coding, testing, and upgrading, to troubleshooting, performing security scanning and fixes, optimizing AWS resources, and creating data engineering pipelines. Customers are already seeing the value in Amazon Q Developer: Eviden, a digital transformation services company, is experiencing 20-40% in productivity gains; Switchboard MD, a healthcare company, has reduced their time to deploy new features for products by 25%; and Datapel Systems, a warehouse management and inventory stock solutions company, is achieving a remarkable efficiency improvement of at least 70%.
Amazon Q Developer helps developers build faster and more securely by generating code suggestions and recommendations in near real time. In fact, Amazon Q Developer has the highest reported code acceptance rates in the industry for assistants that perform multi-line code suggestions, with BT Group recently reporting they accepted 37% of Q’s code suggestions and National Australia Bank reporting a 50% acceptance rate. Previously, some of these coding assistant capabilities were provided by Amazon CodeWhisperer, and are now part of Amazon Q Developer.
Amazon Q Developer agent capabilities can autonomously perform a range of tasks–everything from implementing features, documenting, and refactoring code, to performing software upgrades. You can ask Amazon Q Developer to add a new checkout feature to your e-commerce app, and it will analyze your existing codebase, map out the implementation plan spanning multiple files, and upon your approval, execute all the required code changes and tests in minutes. Check out this feature in action, Implement an API with Amazon Q Developer Agent for Software Development. Carrying out these tasks, the agent for software development achieved the highest scores of 13.4% on the SWE-Bench Leaderboard and 20.5% on the SWE-bench Leaderboard (Lite), a dataset that benchmarks coding capabilities. These updates will be available to customers in the coming days.
Q can also automate app upgrades, reducing days of work to minutes. Recently, a five-person Amazon team used the Amazon Q Code Transformation agent to upgrade over 1,000 production applications from Java 8 to Java 17 in just two days (the average time per application was less than 10 minutes), saving months of time and improving application performance. Today, Amazon Q performs Java language upgrades, and cross-platform .NET upgrades are coming soon to accelerate migrations to Linux saving customers millions in licensing fees. Check out the code transformation agent in action, Upgrade a Java App with Amazon Q Developer Agent for Code Transformation.
Starting today, you can also ask Q Developer questions about your AWS account like “What instances are currently running in US East 1?” or “What’s my S3 bucket encryption?” or “What were my EC2 costs by region last month?” and Amazon Q Developer will list the resources and details, in a summarized answer with links to learn more.
Learn more about Amazon Q Developer at the AWS News Blog.
Amazon Q Business empowers employees to be more creative, data-driven, efficient, prepared, and productive
Our vision for Amazon Q Business is to make the power of generative AI accessible to every business to get insights from all their data (unstructured and structured), take actions, and build applications.
Most companies have troves of valuable data that is hard to access and parse through. With Amazon Q Business, employees can get answers to questions across business data such as company policies, product information, business results, code base, people, and many other topics by connecting to enterprise data repositories to summarize the data logically, analyze trends, and engage in dialog about the data. To make this possible, Amazon Q Business has more built-in, managed and secure data connectors to connect your enterprise data than any other generative AI assistant. This includes commonly used business tools, such as wikis, intranets, Atlassian, Gmail, Microsoft Exchange, Salesforce, ServiceNow, Slack, and Amazon Simple Storage Service (Amazon S3). You can even build custom plugins enabling Amazon Q Business to directly take actions like submitting time-off requests. Need the latest sales figures summarized? Looking for competitive intel on a prospective client? Amazon Q Business’s advanced language capabilities will quickly synthesize relevant info from scattered documents, databases and chat logs, into a coherent response.
One of the most exciting announcements today is a feature of Amazon Q Business called Amazon Q Apps. Amazon Q Apps helps every employee go from conversation to building generative AI-powered app in seconds, making it much easier to streamline and automate daily tasks. Creating an application with Amazon Q Apps is straightforward—employees can describe the type of app they want in natural language, or just tell Amazon Q Apps to do it from a conversation where Amazon Q helped solve a problem. For instance, a marketer could ask Q Apps to create an app that generates compelling customer stories by just inputting the customer’s name, products used, business challenge, and business impact. In seconds, Q creates the app that is then shareable to other marketers throughout the organization. heck out more examples at Introducing Amazon Q Apps (Preview).
Recently, I sat down with Praerit Garg, President of Product & Innovation at Smartsheet, to discuss how Amazon Q Business is helping their employees get answers faster and ultimately improve productivity, as well as onboard new hires more quickly.
Learn more about Amazon Q Business at the Amazon Machine Learning Blog.
Generative BI allows analysts to build detailed dashboards in minutes and business users to get insights fast
Historically, businesses have stored large amounts of their valuable structured data in their databases and data warehouses, which are usually accessible only through business intelligence (BI) tools. When business executives needed to extract information from the data, they had to rely on over-taxed business analysts to build dashboards, which could often take days or weeks. Even when dashboards were created it was difficult to extract and share important insights from these dashboards. Now, Amazon Q brings its advanced generative AI technology to Amazon QuickSight, AWS’s unified BI service built for the cloud. With Amazon Q in QuickSight, customers get a generative BI assistant that allows business analysts to use natural language to reduce the time to build a BI dashboard from hours to minutes. And it helps everyone in an organization become more data-driven by making data more accessible. It is the only BI product where business users can get AI-driven executive summaries of dashboards, ask questions of data beyond what is presented in the dashboards–for instant answers, and create detailed and customizable data stories highlighting key insights, trends, and drivers. All you have to do is say what you want in natural language.
Showpad, a leading provider of sales enablement solutions, is able to give customers the ability to query data without the need for a complex user interface or the need-to-know SQL. Integration took just a little over a week, and Showpad was able to customize the experience so it blends seamlessly into their own experience.
Clinigence Health is leveraging Amazon Q in QuickSight to identify insights and trends within data in minutes, a process that previously took hours.
See more about Amazon Q in QuickSight at the Business Intelligence Blog.
New pricing including Amazon Q Developer Free Tier
Along with these new features, we’re also announcing new pricing tiers that make it even easier to use Amazon Q. We are offering an Amazon Q Developer Free Tier, which provides free coding to individuals in the IDE and command line, as well as free limited usage of most advanced capabilities, like Amazon Q Developer Agents. For customers who want organizational license management, the ability to customize Amazon Q Developer to their codebase for more relevant coding suggestions, as well as higher limits on advanced capabilities, we are offering the Amazon Q Developer Pro tier for $19 per user per month. The Amazon Q Business Pro subscription at $20 per user/per month provides users access to the full suite of Amazon Q Business capabilities, including access to Amazon Q Apps and Amazon Q in QuickSight (Reader Pro). Finally, you can access a free trial of Amazon Q Developer Pro and Amazon Q Business Pro until June 30, 2024.
We’re excited about bringing you these new capabilities across Amazon Q Developer, Amazon Q Business, and Amazon Q in QuickSight. We are just getting started in helping you be more productive, better leverage your organizational data, and create new ways of working.
Check out the following resources to learn more about this announcement:
- Visit community.aws to find deep-dive technical content and to discover how our builder communities are using Amazon Q in their solutions
- Learn more about Generative AI on AWS
- Amazon Q Introduction (15-minute course)
- Amazon Q Business Getting Started (one-hour course that introduces developers and technical audiences to Amazon Q Business’s features and use cases, and teaches how to build a chatbot using Amazon Q)
About the author
Swami Sivasubramanian is Vice President of Data and Machine Learning at AWS. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.
Amazon Q Business and Amazon Q in QuickSight empowers employees to be more data-driven and make better, faster decisions using company knowledge
Today, we announced the General Availability of Amazon Q, the most capable generative AI powered assistant for accelerating software development and leveraging companies’ internal data. “During the preview, early indications signaled Amazon Q could help our customers’ employees become more than 80% more productive at their jobs; and with the new features we’re planning on introducing in the future, we think this will only continue to grow,” shared Dr. Swami Sivasubramanian, vice president of Artificial Intelligence and Data at AWS. Employees across every organization collectively spend hours every week searching internal sources for information, piecing together analyses, writing reports, building presentations, creating and searching for insights in dashboards, or adapting content for different customers or audiences. We built Amazon Q Business and Amazon Q in QuickSight to make this much simpler.
Amazon Q Business is a generative AI–powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. It empowers employees to be more creative, data-driven, efficient, prepared, and productive.
Amazon Q Business unites more data sources than any other generative AI assistant available today
Amazon Q Business easily and securely connects to 40+ commonly used business tools, such as wikis, intranets, Atlassian, Gmail, Microsoft Exchange, Salesforce, ServiceNow, Slack, and Amazon Simple Storage Service (Amazon S3)–more than any other generative AI assistant available today. Simply point Q at your enterprise data repositories, and it will search all of your data, summarize logically, analyze trends, and engage in dialog with end users about the data. This helps business users to access all of their data, no matter where it resides in their organization. Watch use cases of Amazon Q Business through its simple web base interface.
Built from the ground up with security and privacy in mind
Amazon Q Business is built to be secure and private by design. It seamlessly integrates with a customer’s existing identities, roles, and access permissions to personalize the interactions for each individual user, while maintaining the highest levels of security. It generates accurate responses based on enterprise information, and customers can restrict sensitive topics, block keywords, and filter out inappropriate content and does not use customer content to train the underlying model for anybody else. If you want to learn more about how to set up and administer Q Business, check out the News Blog: Amazon Q Business.
Generative BI allows analysts and business users to build detailed dashboards in minutes
Amazon QuickSight is AWS’s unified Business Intelligence (BI) service built for the cloud. With Amazon Q in QuickSight, customers get a Generative BI assistant that allows business analysts to use natural language to build BI dashboards in minutes and easily create visualizations and complex calculations. It is also the only BI product where business users can get AI-driven executive summaries of dashboards, ask questions of data beyond what is presented in the dashboards–for instant answers, and create detailed and customizable data stories highlighting key insights, trends, and drivers. Business users can ask to “build a story about how the business has changed over the last month for a business review with leadership” and in seconds Amazon Q creates a story in multiple parts explaining different aspects of their data with specific insights and supporting visuals, including specific ideas of how to improve the business. Users can choose to layout content in an easy to share document or presentation where they can customize text, images, and themes, and use Amazon Q to rewrite and improve the text. You can read more about all the updates to Amazon QuickSight at the AWS Business Intelligence Blog, and watch the Unlock the power of Generative BI with Amazon Q in QuickSight on creating and sharing content based on your own data.
First-of-its-kind capability that helps every employee go from conversation to generative AI-powered app in seconds
Today, we announced a new capability of Amazon Q Business, called Amazon Q Apps (in preview) that allows employees to easily and quickly create generative AI-powered apps based on their company data, without requiring any prior coding experience. With Amazon Q Apps, employees simply describe the app they want, in natural language, or they can take an existing conversation where Amazon Q Business helped them solve a problem and, with one click, Amazon Q will instantly generate an app that accomplishes their desired task that can be easily shared across their organization.
For example, generating employee onboarding plans for new recruits can be a long and laborious process. They require many hours of searching through different data stores and documents to find the appropriate content for the new employee and oftentimes the content is out of date or not specific enough to their role. With Amazon Q, an HR professional can simply describe an app that could pull together a personalized onboarding plan for a new employee, simply by inputting their name and employee ID. In a matter of seconds, Amazon Q Apps will build an app that can automatically generate a personalized onboarding plan tailored to the employee, their role, and the department using the latest data. The HR professional can then share the app with hiring managers across the company to instantly build personalized onboarding plans for their own teams. Now, with Amazon Q Apps, business users can easily, quickly, and securely build an app based on enterprise information to improve their work productivity. Watch the Introducing Amazon Q Apps to see how easy it is to implement.
Bani Bedi, senior vice president, Corporate Development and Strategy at Smartsheet, said:
“Amazon Q Business is streamlining knowledge management and accelerating employee productivity at Smartsheet. Previously, it was too difficult for our 3,300 employees to find the information they needed across public help documents, training courses, and hundreds of all-employee Slack help channels. We have consolidated our organizational knowledge into a single AI engine to give our workforce immediate answers, significantly boosting employee productivity.”
You can hear more in the interview AWS Fireside Chat with Smartsheet.
We’re really excited to share Amazon Q Business and Amazon Q in QuickSight with you. If you want more information on generative AI at AWS, you can find it AWS Generative AI.
About the Authors
Mukesh Karki is GM of Amazon Q Business.
Tracy Daugherty is GM of Amazon Quicksight.
Develop and train large models cost-efficiently with Metaflow and AWS Trainium
This is a guest post co-authored with Ville Tuulos (Co-founder and CEO) and Eddie Mattia (Data Scientist) of Outerbounds.
To build a production-grade AI system today (for example, to do multilingual sentiment analysis of customer support conversations), what are the primary technical challenges? Historically, natural language processing (NLP) would be a primary research and development expense. In 2024, however, organizations are using large language models (LLMs), which require relatively little focus on NLP, shifting research and development from modeling to the infrastructure needed to support LLM workflows.
For AWS and Outerbounds customers, the goal is to build a differentiated machine learning and artificial intelligence (ML/AI) system and reliably improve it over time. This often means the method of using a third-party LLM API won’t do for security, control, and scale reasons. Owning the infrastructural control and knowhow to run workflows that power AI systems is a requirement.
Returning to the original question, three MLOps challenges may arise:
- You need high-quality data to train and fine-tune models
- You need a diverse cloud infrastructure for experimentation, training, tracking, and orchestrating the production system
- You need a significant amount of compute to power the system
In this post, we highlight a collaboration between Outerbounds and AWS that takes a step towards addressing the last two challenges. First, the AWS Trainium accelerator provides a high-performance, cost-effective, and readily available solution for training and fine-tuning large models. Second, open source Metaflow provides the necessary software infrastructure to build production-grade ML/AI systems in a developer-friendly manner. It provides an approachable, robust Python API for the full infrastructure stack of ML/AI, from data and compute to workflows and observability.
In the following sections, we first introduce Metaflow and the Trainium integration. We then show how to set up the infrastructure stack you need to take your own data assets and pre-train or fine-tune a state-of-the-art Llama2 model on Trainium hardware.
Metaflow overview
Metaflow was originally developed at Netflix to enable data scientists and ML engineers to build ML/AI systems quickly and deploy them on production-grade infrastructure. Netflix open sourced the framework in 2019 with integrations to AWS services like AWS Batch, AWS Step Functions (see Unbundling Data Science Workflows with Metaflow and AWS Step Functions), Kubernetes, and throughput-optimized Amazon Simple Storage Service (Amazon S3), so you can build your own Netflix-scale ML/AI environment in your AWS account.
The key motivation of Metaflow is to address the typical needs of all ML/AI projects with a straightforward, human-centric API, from prototype to production (and back). The following figure illustrates this workflow.
Metaflow’s coherent APIs simplify the process of building real-world ML/AI systems in teams. Metaflow helps scientists and engineers access, move, and manipulate data efficiently; track and version experiments and models; orchestrate and integrate workflows to surrounding systems; and scale compute to the cloud easily. Moreover, it has first-class support for teams, such as namespacing and deploying workflows in versioned production branches.
Now, with today’s announcement, you have another straightforward compute option for workflows that need to train or fine-tune demanding deep learning models: running them on Trainium.
How Metaflow integrates with Trainium
From a Metaflow developer perspective, using Trainium is similar to other accelerators. After a Metaflow deployment is configured to access Trainium chips through the compute platform customers use with Metaflow (which we discuss later in this post), ML engineers and data scientists can operate autonomously in the land of deep learning code. Scientists can write PyTorch, Hugging Face, and use the AWS Neuron SDK along with the NeuronX Distributed SDK to optimize these frameworks to target Trainium devices, and Metaflow integrates with the underlying AWS services to separate concerns about how to actually run the code at scale.
As illustrated by the following figure, you can declare the following in a few lines of Python code:
- How many nodes to launch
- How many Trainium devices to use per node
- How the nodes are interconnected (Elastic Fabric Adapter)
- How often to check the resource utilization
- What training script the torchrun process should run on each node
You can initialize the training process in the start
step, which directs the next train
step to run on two parallel instances (num_parallel=2
). The decorators of the train
step configure your desired training setup:
@torchrun
– Sets up PyTorch Distributed across two instances@batch
– Configures the Trainium nodes, managed by AWS Batch@neuron_monitor
– Activates the monitoring UI that allows you to monitor the utilization of the Trainium cores
Metaflow allows you to configure all this functionality in a few lines of code. However, the main benefit is that you can embed Trainium-based training code inside a larger production system, using the scaffolding provided by Metaflow.
Benefits of using Trainium with Metaflow
Trainium and Metaflow work together to solve problems like what we discussed earlier in this post. The Trainium devices and Neuron software stack make it straightforward for teams to access and effectively use the high-performance hardware needed for cutting-edge AI.
Trainium provides a few key benefits for building real-world AI systems:
- Trainium instances can help reduce generative AI model training and fine-tuning costs by up to 50% over comparable instances on AWS
- It is readily available in many AWS Regions, is often more available than GPU-based instance types, and scaling is available in the most popular Regions worldwide
- The hardware and software are mature and actively developed by AWS
If you have been struggling with GPU availability and cost, you’ll surely appreciate these benefits. Using Trainium effectively can require a bit of infrastructure effort and knowledge, which is a key motivation for this integration. Through Metaflow and the deployment scripts provided in this post, you should be able to get started with Trainium with ease.
Besides easy access, using Trainium with Metaflow brings a few additional benefits:
Infrastructure accessibility
Metaflow is known for its developer-friendly APIs that allow ML/AI developers to focus on developing models and applications, and not worry about infrastructure. Metaflow helps engineers manage the infrastructure, making sure it integrates with existing systems and policies effortlessly.
Data, model, and configuration management
Metaflow provides built-in, seamless artifact persistence, tracking, and versioning, which covers the full state of the workflows, making sure you’ll follow MLOps best practices. Thanks to Metaflow’s high-throughput S3 client, you can load and save datasets and model checkpoints very quickly, without having to worry about extra infrastructure such as shared file systems. You can use artifacts to manage configuration, so everything from hyperparameters to cluster sizing can be managed in a single file, tracked alongside the results.
Observability
Metaflow comes with a convenient UI, which you can customize to observe metrics and data that matter to your use cases in real time. In the case of Trainium, we provide a custom visualization that allows you to monitor utilization of the NeuronCores inside Trainium instances, making sure that resources are used efficiently. The following screenshot shows an example of the visualization for core (top) and memory (bottom) utilization.
Multi-node compute
Finally, a huge benefit of Metaflow is that you can use it to manage advanced multi-instance training clusters, which would take a lot of involved engineering otherwise. For instance, you can train a large PyTorch model, sharded across Trainium instances, using Metaflow’s @torchrun
and @batch
decorators.
Behind the scenes, the decorators set up a training cluster using AWS Batch multi-node with a specified number of Trainium instances, configured to train a PyTorch model across the instances. By using the launch template we provide in this post, the setup can benefit from low-latency, high-throughput networking via Elastic Fabric Adapter (EFA) networking interfaces.
Solution overview
As a practical example, let’s set up the complete stack required to pre-train Llama2 for a few epochs on Trainium using Metaflow. The same recipe applies to the fine-tuning examples in the repository.
Deploy and configure Metaflow
If you already use a Metaflow deployment, you can skip to the next step to deploy the Trainium compute environment.
Deployment
To deploy a Metaflow stack using AWS CloudFormation, complete the following steps:
- Download the CloudFormation template.
- On the CloudFormation console, choose Stacks in the navigation pane.
- Choose Create new stack.
- For Prepare template¸ select Template is ready.
- For Template source, select Upload a template file.
- Upload the template.
- Choose Next.
- If you are brand new to Metaflow, or are trying this recipe as a proof of concept, we suggest you change the
APIBasicAuth
parameter tofalse
and leave all other default parameter settings. - Complete the stack creation process.
After you create the CloudFormation stack and configure Metaflow to use the stack resources, there is no additional setup required. For more information about the Metaflow components that AWS CloudFormation deploys, see AWS Managed with CloudFormation.
Configuration
To use the stack you just deployed from your laptop or cloud workstation, complete the following steps:
- Prepare a Python environment and install Metaflow in it:
- Run
metaflow configure aws
in a terminal.
After the CloudFormation stack deployment is complete, the Outputs on the stack details page will contain a list of resource names and their values, which you can use in the Metaflow AWS configuration prompts.
Deploy a Trainium compute environment
The default Metaflow deployment from the previous step has an AWS Batch compute environment, but it will not be able to schedule jobs to run on Amazon Elastic Compute Cloud (Amazon EC2) instances with Trainium devices. To deploy an AWS Batch compute environment for use with Trainium accelerators, you can use the following CloudFormation template. Complete the following steps:
- Download the CloudFormation template.
- On the CloudFormation console, choose Stacks in the navigation pane.
- Choose Create new stack.
- For Prepare template¸ select Template is ready.
- For Template source, select Upload a template file.
- Upload the template.
- Choose Next.
- Complete the stack creation process.
Take note of the name of the AWS Batch job queue that you created to use in a later step.
Prepare a base Docker image to run Metaflow tasks
Metaflow tasks run inside Docker containers when AWS Batch is used as a compute backend. To run Trainium jobs, developers need to build a custom image and specify it in the @batch
decorator Metaflow developers use to declare task resources:
To make the image, complete the following steps:
- Create an Amazon Elastic Container Registry (Amazon ECR) registry to store your image in.
- Create and log in to an EC2 instance with sufficient memory. For this post, we used Ubuntu x86 OS on a C5.4xlarge instance.
- Install Docker.
- Copy the following Dockerfile to your instance.
- Authenticate with the upstream base image provider:
- Build the image:
- On the Amazon ECR console, navigate to the ECR registry you created, and you will find the commands needed to authenticate from the EC2 instance and push your image.
Clone the repository on your workstation
Now you’re ready to verify the infrastructure is working properly, after which you can run complex distributed training code like Llama2 training. To get started, clone the examples repository to the workstation where you configured Metaflow with AWS:
Verify the infrastructure with an allreduce example
To validate your infrastructure configuration, complete the following steps:
- Navigate to the
allreduce
example:
- Open the flow.py file and make sure to set the job queue and image to the name of the queue you deployed with AWS CloudFormation and the image you pushed to Amazon ECR, respectively.
- To run the
allreduce
code, run the following Metaflow command:
You can find the logs (truncated in the following code snippet for readability) in the Metaflow UI:
Configure and run any Neuron distributed code
If the allreduce
test runs successfully, you are ready to move on to meaningful workloads. To complete this onboarding, complete the following steps:
- Navigate to the
llama2-7b-pretrain-trn
directory. - Similar to the all reduce example, before using this code, you need to modify the config.py file so that it matches the AWS Batch job queue and ECR image that you created. Open the file, find these lines, and modify them to your values:
- After modifying these values, and any others you want to experiment with, run the following command:
- Then run the workflow to pre-train your own Llama2 model from scratch:
This will train the model on however many nodes you specify in the config.py file, and will push the trained model result to Amazon S3 storage, versioned by Metaflow’s data store using the flow name and run ID.
Logs will appear like the following (truncated from a sample run of five steps for readability):
Clean up
To clean up resources, delete the CloudFormation stacks for your Metaflow deployment and Trainium compute environment:
Conclusion
You can get started experimenting with the solution presented in this post in your environment today. Follow the instructions in the GitHub repository to pre-train a Llama2 model on Trainium devices. Additionally, we have prepared examples for fine-tuning Llama2 and BERT models, demonstrating how you can use the Optimum Neuron package to use the integration from this post with any Hugging Face model.
We are happy to help you get started. Join the Metaflow community Slack for support, to provide feedback, and share experiences!
About the authors
Ville Tuulos is a co-founder and CEO of Outerbounds, a developer-friendly ML/AI platform. He has been developing infrastructure for ML and AI for over two decades in academia and as a leader at a number of companies. At Netflix, he led the ML infrastructure team that created Metaflow, a popular open-source, human-centric foundation for ML/AI systems. He is also the author of a book, Effective Data Science Infrastructure, published by Manning.
Eddie Mattia is in scientific computing and more recently building machine learning developer tools. He has worked as a researcher in academia, in customer-facing and engineering roles at MLOps startups, and as a product manager at Intel. Currently, Eddie is working to improve the open-source Metaflow project and is building tools for AI researchers and MLOps developers at Outerbounds.
Vidyasagar specializes in high performance computing, numerical simulations, optimization techniques and software development across industrial and academic environments. At AWS, Vidyasagar is a Senior Solutions Architect developing predictive models, generative AI and simulation technologies. Vidyasagar has a PhD from the California Institute of Technology.
Diwakar Bansal is an AWS Senior Specialist focused on business development and go-to-market for GenAI and Machine Learning accelerated computing services. Diwakar has led product definition, global business development, and marketing of technology products in the fields of IOT, Edge Computing, and Autonomous Driving focusing on bringing AI and Machine leaning to these domains. Diwakar is passionate about public speaking and thought leadership in the Cloud and GenAI space.
Sadaf Rasool is a Machine Learning Engineer with the Annapurna ML Accelerator team at AWS. As an enthusiastic and optimistic AI/ML professional, he holds firm to the belief that the ethical and responsible application of AI has the potential to enhance society in the years to come, fostering both economic growth and social well-being.
Scott Perry is a Solutions Architect on the Annapurna ML accelerator team at AWS. Based in Canada, he helps customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. His interests include large language models, deep reinforcement learning, IoT, and genomics.
Cohere Command R and R+ are now available in Amazon SageMaker JumpStart
This blog post is co-written with Pradeep Prabhakaran from Cohere.
Today, we are excited to announce that Cohere Command R and R+ foundation models are available through Amazon SageMaker JumpStart to deploy and run inference. Command R/R+ are the state-of-the-art retrieval augmented generation (RAG)-optimized models designed to tackle enterprise-grade workloads.
In this post, we walk through how to discover and deploy Cohere Command R/R+ via SageMaker JumpStart.
What are Cohere Command R and Command R+?
Cohere Command R is a family of highly scalable language models that balance high performance with strong accuracy. Command R family – include Command R and Command R+ models – are optimized for RAG based workflows such as conversational interaction and long context tasks, enabling companies to move beyond proof of concept and into production. These powerful models are designed to handle complex tasks with high performance and strong accuracy, making them suitable for real-world applications.
Command R boasts high precision on RAG and tool use tasks, low latency and high throughput, a long 128,000-token context length, and strong capabilities across 10 key languages: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, and Chinese.
Command R+ is the newest model, optimized for extremely performant conversational interaction and long-context tasks. It is recommended for workflows that lean on complex RAG functionality and multi-step tool use (agents), while Cohere R is well-suited for simpler RAG and single-step tool use tasks, as well as applications where price is a major consideration.
What is SageMaker JumpStart
With SageMaker JumpStart, you can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated SageMaker instances from a network-isolated environment and customize models using SageMaker for model training and deployment. You can now discover and deploy Cohere Command R/R+ models with a few choices in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK. Doing so enables you to derive model performance and machine learning operations (MLOps) controls with SageMaker features such as SageMaker Pipelines, SageMaker Debugger, or container logs.
The model is deployed in an AWS secure environment and under your virtual private cloud (VPC) controls, helping provide data security. Cohere Command R/R+ models are available today for deployment and inferencing in Amazon SageMaker Studio in us-east-1
(N. Virginia), us-east-2
(Ohio), us-west-1
(N. California), us-west-2
(Oregon), Canada (Central), eu-central-1
(Frankfurt), eu-west-1
(Ireland), eu-west-2
(London), eu-west-3
(Paris), eu-north-1
(Stockholm), ap-southeast-1
(Singapore), ap-southeast-2
(Sydney), ap-northeast-1
(Tokyo) , ap-northeast-2
(Seoul), ap-south-1
(Mumbai), and sa-east-1
(Sao Paulo).
Discover models
You can access the foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.
From the SageMaker JumpStart landing page, you can easily discover various models by browsing through different hubs, which are named after model providers. The Cohere Command R and R+ models are available in the Cohere hub. If you don’t see these models, ensure you have the latest SageMaker Studio version by shutting down and restarting Studio Classic Apps.
To find the Command R and R+ models, search for “Command R” in the search box located at the top left of the SageMaker JumpStart landing page. Each model can be deployed on Amazon Elastic Compute Cloud (EC2) P5 instances powered by NVIDIA H100 Tensor Core GPUs (p5.48xlarge) and Amazon EC2 P4de instances powered by NVIDIA A100 Tensor Core GPUs (ml.p4de.24xlarge).
Deploy a model
To illustrate model deployment, we’ll deploy Cohere Command R+ on NVIDIA H100. Choose the model card to open the corresponding model detail page.
When you choose Deploy, a window appears prompting you to subscribe to the model on AWS Marketplace. Choose Subscribe, which redirects you to the AWS Marketplace listing for Cohere Command R+ (H100). Follow the on-screen instructions to complete the subscription process.
Once subscribed, return to the model detail page and choose Deploy in the window. The deployment process initiates.
Alternatively, you can choose Notebooks on the model card and open the example notebook in JupyterLab. This notebook provides end-to-end guidance on deploying the model for inference and cleaning up resources. You can also find this example notebook in the Cohere SageMaker GitHub repository. To ensure the security of the endpoint, you can configure AWS Key Management Service (KMS) key for a SageMaker endpoint configuration.
If an endpoint has already been created, you can simply connect to it:
Real-time inference
Once your endpoint has been connected, you can perform real-time inference using the co.chat endpoint.
Multilingual capabilities
Command R/R+ is optimized to perform well in 10 key languages, as listed in the introduction. Additionally, pre-training data have been included for the following 13 languages: Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, Persian.
The model has been trained to respond in the language of the user. Here’s an example in Spanish:
Here’s what the response might look like:
Command R/R+ can also perform cross-lingual tasks, such as translation or answering questions about content in other languages.
Chat with documents (RAG)
Command R/R+ can ground its generations. This means that it can generate responses based on a list of supplied document snippets, and it includes citations in its response indicating the source of the information.
For example, the code snippet that follows produces an answer to “How deep is the Mariana Trench” along with inline citations based on the provided on-line documents.
Request:
Response:
Single-Step & Multi-Step Tool Use
Command R/R+, comes with a Tool Use API that enables the language model to interact with user-defined tools to automate highly sophisticated tasks. Command R/R+ in Tool Use mode creates API payloads (JSONs with specific parameters) based on user interactions and conversational history. These can be used to instruct any other application or tool.
For example, an application can be instructed to automatically categorize and route support tickets to the appropriate individual, change a status in customer relationship management software (CRM), or retrieve relevant snippets from a vector database. It comes in two variants; single-step and multi-step:
- Single-step tool use enables a richer set of behaviors by leveraging data stored in tools, taking actions through APIs, interacting with a vector database, querying a search engine, etc.
- Multi-step tool use is an extension of this basic idea and allows the model to call more than one tool in a sequence of steps, using the results from one tool call in a subsequent step. This process allows the language model to reason, perform dynamic actions, and quickly adapt based on information coming from external sources.
To explore these capabilities further, you can refer to the provided Jupyter notebook and Cohere’s AWS GitHub repository, which offer additional examples showcasing various use cases and applications.
Clean Up
After you’ve finished running the notebook and exploring the Cohere Command R and R+ models, it’s essential to clean up the resources you’ve created to avoid incurring unnecessary charges. Follow these steps to delete the resources and stop the billing:
Conclusion
In this post, we explored how to leverage the powerful capabilities of Cohere’s Command R and R+ models on Amazon SageMaker JumpStart. These state-of-the-art large language models are specifically designed to excel at real-world enterprise use cases, offering unparalleled performance and scalability. With their availability on SageMaker JumpStart and AWS Marketplace, you now have seamless access to these cutting-edge models, enabling you to unlock new levels of productivity and innovation in your natural language processing projects.
About the authors
Pradeep Prabhakaran is a Customer Solutions Architect at Cohere. In his current role at Cohere, Pradeep acts as a trusted technical advisor to customers and partners, providing guidance and strategies to help them realize the full potential of Cohere’s cutting-edge Generative AI platform. Prior to joining Cohere, Pradeep was a Principal Customer Solutions Manager at Amazon Web Services, where he led Enterprise Cloud transformation programs for large enterprises. Prior to AWS, Pradeep has held various leadership positions at consulting companies such as Slalom, Deloitte, and Wipro. Pradeep holds a Bachelor’s degree in Engineering and is based in Dallas, TX.
James Yi is a Senior AI/ML Partner Solutions Architect at Amazon Web Services. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in GenAI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.
Revolutionizing large language model training with Arcee and AWS Trainium
This is a guest post by Mark McQuade, Malikeh Ehghaghi, and Shamane Siri from Arcee.
In recent years, large language models (LLMs) have gained attention for their effectiveness, leading various industries to adapt general LLMs to their data for improved results, making efficient training and hardware availability crucial. At Arcee, we focus primarily on enhancing the domain adaptation of LLMs in a client-centric manner. Arcee’s innovative continual pre-training (CPT) and model merging techniques have brought a significant leap forward in the efficient training of LLMs, with particularly strong evaluations in the medical, legal, and financial verticals. Close collaboration with AWS Trainium has also played a major role in making the Arcee platform extremely performant, not only accelerating model training but also reducing overall costs and enforcing compliance and data integrity in the secure AWS environment. In this post, we show you how efficient we make our continual pre-training by using Trainium chips.
Understanding continual pre-training
Arcee recognizes the critical importance of continual CPT [1] in tailoring models to specific domains, as evidenced by previous studies such as PMC-LLaMA [2] and ChipNeMo [3]. These projects showcase the power of domain adaptation pre-training in enhancing model performance across diverse fields, from medical applications to industrial chip design. Inspired by these endeavors, our approach to CPT involves extending the training of base models like Llama 2 using domain-specific datasets, allowing us to fine-tune models to the nuances of specialized fields. To further amplify the efficiency of our CPT process, we collaborated with the Trainium team, using their cutting-edge technology to enhance a Llama 2 [4] model using a PubMed dataset [2] comprising 88 billion tokens. This collaboration represents a significant milestone in our quest for innovation, and through this post, we’re excited to share the transformative insights we’ve gained. Join us as we unveil the future of domain-specific model adaptation and the potential of CPT with Trainium in optimizing model performance for real-world applications.
Dataset collection
We followed the methodology outlined in the PMC-Llama paper [6] to assemble our dataset, which includes PubMed papers sourced from the Semantic Scholar API and various medical texts cited within the paper, culminating in a comprehensive collection of 88 billion tokens. For further details on the dataset, the original paper offers in-depth information.
To prepare this dataset for training, we used the Llama 2 tokenizer within an AWS Glue pipeline for efficient processing. We then organized the data so that each row contained 4,096 tokens, adhering to recommendations from the Neuron Distributed tutorials.
Why Trainium?
Continual pre-training techniques like the ones described in this post require access to high-performance compute instances, which has become more difficult to get as more developers are using generative artificial intelligence (AI) and LLMs for their applications. Traditionally, these workloads have been deployed to GPUs; however, in recent years, the cost and availability of GPUs has stifled model building innovations. With the introduction of Trainium, we are able to unlock new techniques that enable us to continue model innovations that will allow us to build models more efficiently and most importantly, at lower costs. Trainium is the second-generation machine learning (ML) accelerator that AWS purpose built to help developers access high-performance model training accelerators to help lower training costs by up to 50% over comparable Amazon Elastic Compute Cloud (Amazon EC2) instances. With Trainium available in AWS Regions worldwide, developers don’t have to take expensive, long-term compute reservations just to get access to clusters of GPUs to build their models. Trainium instances offer developers the performance they need with the elasticity they want to optimize both for training efficiency and lowering model building costs.
Setting up the Trainium cluster
We used AWS ParallelCluster to build a High Performance Computing (HPC) compute environment that uses Trn1 compute nodes to run our distributed ML training job (see the GitHub tutorial). You can also use developer flows like Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Ray, or others (to learn more, see Developer Flows). After the nodes were launched, we ran a training task to confirm that the nodes were working, and used slurm commands to check the job status. In this part, we used the AWS pcluster
command to run a .yaml file to generate the cluster. Our cluster consisted of 16 nodes, each equipped with a trn1n.32xlarge instance featuring 32 GB of VRAM.
We set up our ParallelCluster
infrastructure as shown in the following diagram (source).
As shown in the preceding figure, inside a VPC, there are two subnets, a public one and a private one. The head node resides in the public subnet, and the compute fleet (in this case, Trn1 instances) is in the private subnet. A NAT gateway is also needed in order for nodes in the private subnet to connect to clients outside the VPC. In the following section, we describe how to set up the necessary infrastructure for Trn1 ParallelCluster
.
Set up the environment
To set up your environment, complete the following steps:
- Install the VPC and necessary components for
ParallelCluster
. For instructions, see VPC setup for ParallelCluster with Trn1. - Create and launch
ParallelCluster
in the VPC. For instructions, see Create ParallelCluster.
Now you can launch a training job to submit a model training script as a slurm job.
Deploy to Trainium
Trainium-based EC2 Trn1 instances use the AWS Neuron SDK and support common ML frameworks like PyTorch and TensorFlow. Neuron allows for effortless distributed training and has integrations with Megatron Nemo and Neuron Distributed.
When engaging with Trainium, it’s crucial to understand several key parameters:
- Tensor parallel size – This determines the level of tensor parallelization, particularly in self-attention computations within transformers, and is crucial for optimizing memory usage (not computational time efficiency) during model loading
- NeuronCores – Each Trainium device has two NeuronCores, and an eight-node setup equates to a substantial 256 cores
- Mini batch – This reflects the number of examples processed in each batch as determined by the data loader
- World size – This is the total count of nodes involved in the training operation
A deep understanding of these parameters is vital for anyone looking to harness the power of Trainium devices effectively.
Train the model
For this post, we train a Llama 2 7B model with tensor parallelism. For a streamlined and effective training process, we adhered to the following steps:
- Download the Llama 2 full checkpoints (model weights and tokenizer) from Hugging Face.
- Convert these checkpoints to a format compatible with the Neuron Distributed setup, so they can be efficiently utilized in our training infrastructure.
- Determine the number of steps required per epoch, incorporating the effective batch size and dataset size to tailor the training process to our specific needs.
- Launch the training job, carefully monitoring its progress and performance.
- Periodically save training checkpoints. Initially, this process may be slow due to its synchronous nature, but improvements are anticipated as the NeuronX team works on enhancements.
- Finally, convert the saved checkpoints back to a standard format for subsequent use, employing scripts for seamless conversion.
For more details, you can find the full implementation of the training steps in the following GitHub repository.
Clean up
Don’t forget to tear down any resources you set up in this post.
Results
Our study focused on evaluating the quality of the CPT-enhanced checkpoints. We monitored the perplexity of a held-out PubMed dataset [6] across various checkpoints obtained during training, which provided valuable insights into the model’s performance improvements over time.
Through this journey, we’ve advanced our model’s capabilities, and hope to contribute to the broader community’s understanding of effective model adaptation strategies.
The following figure shows the perplexity of the baseline Llama 2 7B checkpoint vs. its CPT-enhanced checkpoint on the PMC test dataset. Based on these findings, continual pre-training on domain-specific raw data, specifically PubMed papers in our study, resulted in an enhancement of the Llama 2 7B checkpoint, leading to improved perplexity of the model on the PMC test set.
The following figure shows the perplexity of the CPT-enhanced checkpoints of the Llama 2 7B model across varying numbers of trained tokens. The increasing number of trained tokens correlated with enhanced model performance, as measured by the perplexity metric.
The following figure shows the perplexity comparison between the baseline Llama 2 7B model and its CPT-enhanced checkpoints, with and without data mixing. This underscores the significance of data mixing, where we have added 1% of general tokens to the domain-specific dataset, wherein utilizing a CPT-enhanced checkpoint with data mixing exhibited better performance compared to both the baseline Llama 2 7B model and the CPT-enhanced checkpoint solely trained on PubMed data.
Conclusion
Arcee’s innovative approach to CPT and model merging, as demonstrated through our collaboration with Trainium, signifies a transformative advancement in the training of LLMs, particularly in specialized domains such as medical research. By using the extensive capabilities of Trainium, we have not only accelerated the model training process, but also significantly reduced costs, with an emphasis on security and compliance that provides data integrity within a secure AWS environment.
The results from our training experiments, as seen in the improved perplexity scores of domain-specific models, underscore the effectiveness of our method in enhancing the performance and applicability of LLMs across various fields. This is particularly evident from the direct comparisons of time-to-train metrics between Trainium and traditional GPU setups, where Trainium’s efficiency and cost-effectiveness shine.
Furthermore, our case study using PubMed data for domain-specific training highlights the potential of Arcee’s CPT strategies to fine-tune models to the nuances of highly specialized datasets, thereby creating more accurate and reliable tools for professionals in those fields.
As we continue to push the boundaries of what’s possible in LLM training, we encourage researchers, developers, and enterprises to take advantage of the scalability, efficiency, and enhanced security features of Trainium and Arcee’s methodologies. These technologies not only facilitate more effective model training, but also open up new avenues for innovation and practical application in AI-driven industries.
The integration of Trainium’s advanced ML capabilities with Arcee’s pioneering strategies in model training and adaptation is poised to revolutionize the landscape of LLM development, making it more accessible, economical, and tailored to meet the evolving demands of diverse industries.
To learn more about Arcee.ai, visit Arcee.ai or reach out to our team.
Additional resources
- Arcee’s whitepaper: Case Study on How Arcee is Innovating Domain Adaptation, through Continual Pre-Training and Model Merging
- Arcee’s paper on Arxiv on model merging: Arcee’s MergeKit: A Toolkit for Merging Large Language Models
- Arcee’s Mergekit repository on GitHub
References
- Gupta, Kshitij, et al. “Continual Pre-Training of Large Language Models: How to (re) warm your model?.” arXiv preprint arXiv:2308.04014 (2023).
- Wu, Chaoyi, et al. “Pmc-LLaMA: Towards building open-source language models for medicine.” arXiv preprint arXiv:2305.10415 6 (2023).
- Liu, Mingjie, et al. “Chipnemo: Domain-adapted llms for chip design.” arXiv preprint arXiv:2311.00176 (2023).
- Touvron, Hugo, et al. “Llama 2: Open foundation and fine-tuned chat models.” arXiv preprint arXiv:2307.09288 (2023).
- https://aws.amazon.com/ec2/instance-types/trn1/
- Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. (2023). Pmc-llama: Further fine tuning llama on medical papers. arXiv preprint arXiv:2304.14454.
About the Authors
Mark McQuade is the CEO/Co-Founder at Arcee. Mark co-founded Arcee with a vision to empower enterprises with industry-specific AI solutions. This idea emerged from his time at Hugging Face, where he helped spearhead the Monetization team, collaborating with high-profile enterprises. This frontline experience exposed him to critical industry pain points: the reluctance to rely on closed source APIs and the challenges of training open source models without compromising data security.
Shamane Siri Ph.D. is the Head of Applied NLP Research at Arcee. Before joining Arcee, Shamane worked in both industry and academia, developing recommendation systems using language models to address the cold start problem, and focusing on information retrieval, multi-modal emotion recognition, and summarization. Shamane has also collaborated with the Hugging Face Transformers crew and Meta Reality Labs on cutting-edge projects. He holds a PhD from the University of Auckland, where he specialized in domain adaptation of foundational language models.
Malikeh Ehghaghi is an Applied NLP Research Engineer at Arcee. Malikeh’s research interests are NLP, domain-adaptation of LLMs, ML for healthcare, and responsible AI. She earned an MScAC degree in Computer Science from the University of Toronto. She previously collaborated with Lavita AI as a Machine Learning Consultant, developing healthcare chatbots in partnership with Dartmouth Center for Precision Health and Artificial Intelligence. She also worked as a Machine Learning Research Scientist at Cambridge Cognition Inc. and Winterlight Labs, with a focus on monitoring and detection of mental health disorders through speech and language. Malikeh has authored several publications presented at top-tier conferences such as ACL, COLING, AAAI, NAACL, IEEE-BHI, and MICCAI.
Databricks DBRX is now available in Amazon SageMaker JumpStart
Today, we are excited to announce that the DBRX model, an open, general-purpose large language model (LLM) developed by Databricks, is available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. The DBRX LLM employs a fine-grained mixture-of-experts (MoE) architecture, pre-trained on 12 trillion tokens of carefully curated data and a maximum context length of 32,000 tokens.
You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models so you can quickly get started with ML. In this post, we walk through how to discover and deploy the DBRX model.
What is the DBRX model
DBRX is a sophisticated decoder-only LLM built on transformer architecture. It employs a fine-grained MoE architecture, incorporating 132 billion total parameters, with 36 billion of these parameters being active for any given input.
The model underwent pre-training using a dataset consisting of 12 trillion tokens of text and code. In contrast to other open MoE models like Mixtral and Grok-1, DBRX features a fine-grained approach, using a higher quantity of smaller experts for optimized performance. Compared to other MoE models, DBRX has 16 experts and chooses 4.
The model is made available under the Databricks Open Model license, for use without restrictions.
What is SageMaker JumpStart
SageMaker JumpStart is a fully managed platform that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly and with ease, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is the Model Hub, which offers a vast catalog of pre-trained models, such as DBRX, for a variety of tasks.
You can now discover and deploy DBRX models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with Amazon SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping provide data security.
Discover models in SageMaker JumpStart
You can access the DBRX model through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.
In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.
From the SageMaker JumpStart landing page, you can search for “DBRX” in the search box. The search results will list DBRX Instruct and DBRX Base.
You can choose the model card to view details about the model such as license, data used to train, and how to use the model. You will also find the Deploy button to deploy the model and create an endpoint.
Deploy the model in SageMaker JumpStart
Deployment starts when you choose the Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.
DBRX Base
To deploy using the SDK, we start by selecting the DBRX Base model, specified by the model_id
with value huggingface-llm-dbrx-base. You can deploy any of the selected models on SageMaker with the following code. Similarly, you can deploy DBRX Instruct using its own model ID.
This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The Eula value must be explicitly defined as True in order to accept the end-user license agreement (EULA). Also make sure you have the account-level service limit for using ml.p4d.24xlarge or ml.pde.24xlarge for endpoint usage as one or more instances. You can follow the instructions here in order to request a service quota increase.
After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:
Example prompts
You can interact with the DBRX Base model like any standard text generation model, where the model processes an input sequence and outputs predicted next words in the sequence. In this section, we provide some example prompts and sample output.
Code generation
Using the preceding example, we can use code generation prompts as follows:
The following is the output:
Sentiment analysis
You can perform sentiment analysis using a prompt like the following with DBRX:
The following is the output:
Question answering
You can use a question answering prompt like the following with DBRX:
The following is the output:
DBRX Instruct
The instruction-tuned version of DBRX accepts formatted instructions where conversation roles must start with a prompt from the user and alternate between user instructions and the assistant (DBRX-instruct). The instruction format must be strictly respected, otherwise the model will generate suboptimal outputs. The template to build a prompt for the Instruct model is defined as follows:
<|im_start|>
and <|im_end|>
are special tokens for beginning of string (BOS) and end of string (EOS). The model can contain multiple conversation turns between system, user, and assistant, allowing for the incorporation of few-shot examples to enhance the model’s responses.
The following code shows how you can format the prompt in instruction format:
Knowledge retrieval
You can use the following prompt for knowledge retrieval:
The following is the output:
Code generation
DBRX models demonstrate benchmarked strengths for coding tasks. For example, see the following code:
The following is the output:
Mathematics and reasoning
The DBRX models also report strengths in mathematic accuracy. For example, see the following code:
DBRX can provide comprehension as shown in the following output with the math logic:
Clean up
After you’re done running the notebook, make sure to delete all resources that you created in the process so your billing is stopped. Use the following code:
Conclusion
In this post, we showed you how to get started with DBRX in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.
Resources
- SageMaker JumpStart documentation
- SageMaker JumpStart foundation models documentation
- SageMaker JumpStart product detail page
- SageMaker JumpStart model catalog
About the Authors
Shikhar Kwatra is an AI/ML Specialist Solutions Architect at Amazon Web Services, working with a leading Global System Integrator. He has earned the title of one of the Youngest Indian Master Inventors with over 400 patents in the AI/ML and IoT domains. He has over 8 years of industry experience from startups to large-scale enterprises, from IoT Research Engineer, Data Scientist, to Data & AI Architect. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for organizations and supports GSI partners in building strategic industry
Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.
Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the world with his wife.
Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.
Knowledge Bases in Amazon Bedrock now simplifies asking questions on a single document
At AWS re:Invent 2023, we announced the general availability of Knowledge Bases for Amazon Bedrock. With Knowledge Bases for Amazon Bedrock, you can securely connect foundation models (FMs) in Amazon Bedrock to your company data for fully managed Retrieval Augmented Generation (RAG).
In previous posts, we covered new capabilities like hybrid search support, metadata filtering to improve retrieval accuracy, and how Knowledge Bases for Amazon Bedrock manages the end-to-end RAG workflow.
Today, we’re introducing the new capability to chat with your document with zero setup in Knowledge Bases for Amazon Bedrock. With this new capability, you can securely ask questions on single documents, without the overhead of setting up a vector database or ingesting data, making it effortless for businesses to use their enterprise data. You only need to provide a relevant data file as input and choose your FM to get started.
But before we jump into the details of this feature, let’s start with the basics and understand what RAG is, its benefits, and how this new capability enables content retrieval and generation for temporal needs.
What is Retrieval Augmented Generation?
FM-powered artificial intelligence (AI) assistants have limitations, such as providing outdated information or struggling with context outside their training data. RAG addresses these issues by allowing FMs to cross-reference authoritative knowledge sources before generating responses.
With RAG, when a user asks a question, the system retrieves relevant context from a curated knowledge base, such as company documentation. It provides this context to the FM, which uses it to generate a more informed and precise response. RAG helps overcome FM limitations by augmenting its capabilities with an organization’s proprietary knowledge, enabling chatbots and AI assistants to provide up-to-date, context-specific information tailored to business needs without retraining the entire FM. At AWS, we recognize RAG’s potential and have worked to simplify its adoption through Knowledge Bases for Amazon Bedrock, providing a fully managed RAG experience.
Short-term and instant information needs
Although a knowledge base does all the heavy lifting and serves as a persistent large store of enterprise knowledge, you might require temporary access to data for specific tasks or analysis within isolated user sessions. Traditional RAG approaches are not optimized for these short-term, session-based data access scenarios.
Businesses incur charges for data storage and management. This may make RAG less cost-effective for organizations with highly dynamic or ephemeral information requirements, especially when data is only needed for specific, isolated tasks or analyses.
Ask questions on a single document with zero setup
This new capability to chat with your document within Knowledge Bases for Amazon Bedrock addresses the aforementioned challenges. It provides a zero-setup method to use your single document for content retrieval and generation-related tasks, along with the FMs provided by Amazon Bedrock. With this new capability, you can ask questions of your data without the overhead of setting up a vector database or ingesting data, making it effortless to use your enterprise data.
You can now interact with your documents in real time without prior data ingestion or database configuration. You don’t need to take any further data readiness steps before querying the data.
This zero-setup approach makes it straightforward to use your enterprise information assets with generative AI using Amazon Bedrock.
Use cases and benefits
Consider a recruiting firm that needs to analyze resumes and match candidates with suitable job opportunities based on their experience and skills. Previously, you would have to set up a knowledge base, invoking a data ingestion workflow to make sure only authorized recruiters can access the data. Additionally, you would need to manage cleanup when the data was no longer required for a session or candidate. In the end, you would pay more for the vector database storage and management than for the actual FM usage. This new feature in Knowledge Bases for Amazon Bedrock enables recruiters to quickly and ephemerally analyze resumes and match candidates with suitable job opportunities based on the candidate’s experience and skill set.
For another example, consider a product manager at a technology company who needs to quickly analyze customer feedback and support tickets to identify common issues and areas for improvement. With this new capability, you can simply upload a document to extract insights in no time. For example, you could ask “What are the requirements for the mobile app?” or “What are the common pain points mentioned by customers regarding our onboarding process?” This feature empowers you to rapidly synthesize this information without the hassle of data preparation or any management overhead. You can also request summaries or key takeaways, such as “What are the highlights from this requirements document?”
The benefits of this feature extend beyond cost savings and operational efficiency. By eliminating the need for vector databases and data ingestion, this new capability within Knowledge Bases for Amazon Bedrock helps secure your proprietary data, making it accessible only within the context of isolated user sessions.
Now that we’ve covered the feature benefits and the use cases it enables, let’s dive into how you can start using this new feature from Knowledge Bases for Amazon Bedrock.
Chat with your document in Knowledge Bases for Amazon Bedrock
You have multiple options to begin using this feature:
- The Amazon Bedrock console
- The Amazon Bedrock
RetrieveAndGenerate
API (SDK)
Let’s see how we can get started using the Amazon Bedrock console:
- On the Amazon Bedrock console, under Orchestration in the navigation pane, choose Knowledge bases.
- Choose Chat with your document.
- Under Model, choose Select model.
- Choose your model. For this example, we use the Claude 3 Sonnet model (we are only supporting Sonnet at the time of the launch).
- Choose Apply.
- Under Data, you can upload the document you want to chat with or point to the Amazon Simple Storage Service (Amazon S3) bucket location that contains your file. For this post, we upload a document from our computer.
The supported file formats are PDF, MD (Markdown), TXT, DOCX, HTML, CSV, XLS, and XLSX. Make that the file size does not exceed 10 MB and contains no more than 20,000 tokens. A token is considered to be a unit of text, such as a word, sub-word, number, or symbol, that is processed as a single entity. Due to the preset ingestion token limit, it is recommended to use a file under 10MB. However, a text-heavy file, that is much smaller than 10MB, can potentially breach the token limit.
You’re now ready to chat with your document.
As shown in the following screenshot, you can chat with your document in real time.
To customize your prompt, enter your prompt under System prompt.
Similarly, you can use the AWS SDK through the retrieve_and_generate
API in major coding languages. In the following example, we use the AWS SDK for Python (Boto3):
Conclusion
In this post, we covered how Knowledge Bases for Amazon Bedrock now simplifies asking questions on a single document. We explored the core concepts behind RAG, the challenges this new feature addresses, and the various use cases it enables across different roles and industries. We also demonstrated how to configure and use this capability through the Amazon Bedrock console and the AWS SDK, showcasing the simplicity and flexibility of this feature, which provides a zero-setup solution to gather information from a single document, without setting up a vector database.
To further explore the capabilities of Knowledge Bases for Amazon Bedrock, refer to the following resources:
- Knowledge bases for Amazon Bedrock
- Getting started with Amazon Bedrock, RAG, and Vector database in Python
- Vector Embeddings and RAG Demystified: Leveraging Amazon Bedrock, Aurora, and LangChain (Part 1 and Part 2)
Share and learn with our generative AI community at community.aws.
About the authors
Suman Debnath is a Principal Developer Advocate for Machine Learning at Amazon Web Services. He frequently speaks at AI/ML conferences, events, and meetups around the world. He is passionate about large-scale distributed systems and is an avid fan of Python.
Sebastian Munera is a Software Engineer in the Amazon Bedrock Knowledge Bases team at AWS where he focuses on building customer solutions that leverage Generative AI and RAG applications. He has previously worked on building Generative AI-based solutions for customers to streamline their processes and Low code/No code applications. In his spare time he enjoys running, lifting and tinkering with technology.
Amazon Research Awards recipients announced
Awardees, who represent 51 universities in 15 countries, have access to Amazon public datasets, along with AWS AI/ML services and tools.Read More
Deploy a Hugging Face (PyAnnote) speaker diarization model on Amazon SageMaker as an asynchronous endpoint
Speaker diarization, an essential process in audio analysis, segments an audio file based on speaker identity. This post delves into integrating Hugging Face’s PyAnnote for speaker diarization with Amazon SageMaker asynchronous endpoints.
We provide a comprehensive guide on how to deploy speaker segmentation and clustering solutions using SageMaker on the AWS Cloud. You can use this solution for applications dealing with multi-speaker (over 100) audio recordings.
Solution overview
Amazon Transcribe is the go-to service for speaker diarization in AWS. However, for non-supported languages, you can use other models (in our case, PyAnnote) that will be deployed in SageMaker for inference. For short audio files where the inference takes up to 60 seconds, you can use real-time inference. For longer than 60 seconds, asynchronous inference should be used. The added benefit of asynchronous inference is the cost savings by auto scaling the instance count to zero when there are no requests to process.
Hugging Face is a popular open source hub for machine learning (ML) models. AWS and Hugging Face have a partnership that allows a seamless integration through SageMaker with a set of AWS Deep Learning Containers (DLCs) for training and inference in PyTorch or TensorFlow, and Hugging Face estimators and predictors for the SageMaker Python SDK. SageMaker features and capabilities help developers and data scientists get started with natural language processing (NLP) on AWS with ease.
The integration for this solution involves using Hugging Face’s pre-trained speaker diarization model using the PyAnnote library. PyAnnote is an open source toolkit written in Python for speaker diarization. This model, trained on the sample audio dataset, enables effective speaker partitioning in audio files. The model is deployed on SageMaker as an asynchronous endpoint setup, providing efficient and scalable processing of diarization tasks.
The following diagram illustrates the solution architecture.
For this post, we use the following audio file.
Stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels. Audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
Prerequisites
Complete the following prerequisites:
- Create a SageMaker domain.
- Make sure your AWS Identity and Access Management (IAM) user has the necessary access permissions for creating a SageMaker role.
- Make sure the AWS account has a service quota for hosting a SageMaker endpoint for an ml.g5.2xlarge instance.
Create a model function for accessing PyAnnote speaker diarization from Hugging Face
You can use the Hugging Face Hub to access the desired pre-trained PyAnnote speaker diarization model. You use the same script for downloading the model file when creating the SageMaker endpoint.
See the following code:
Prepare essential files like inference.py, which contains the inference code:
Prepare a requirements.txt
file, which contains the required Python libraries necessary to run the inference:
Lastly, compress the inference.py
and requirements.txt files and save it as model.tar.gz
:
Configure a SageMaker model
Define a SageMaker model resource by specifying the image URI, model data location in Amazon Simple Storage Service (S3), and SageMaker role:
Upload the model to Amazon S3
Upload the zipped PyAnnote Hugging Face model file to an S3 bucket:
Create a SageMaker asynchronous endpoint
Configure an asynchronous endpoint for deploying the model on SageMaker using the provided asynchronous inference configuration:
Test the endpoint
Evaluate the endpoint functionality by sending an audio file for diarization and retrieving the JSON output stored in the specified S3 output path:
To deploy this solution at scale, we suggest using AWS Lambda, Amazon Simple Notification Service (Amazon SNS), or Amazon Simple Queue Service (Amazon SQS). These services are designed for scalability, event-driven architectures, and efficient resource utilization. They can help decouple the asynchronous inference process from the result processing, allowing you to scale each component independently and handle bursts of inference requests more effectively.
Results
Model output is stored at s3://sagemaker-xxxx /async_inference/output/.
The output shows that the audio recording has been segmented into three columns:
- Start (start time in seconds)
- End (end time in seconds)
- Speaker (speaker label)
The following code shows an example of our results:
Clean up
You can set a scaling policy to zero by setting MinCapacity to 0; asynchronous inference lets you auto scale to zero with no requests. You don’t need to delete the endpoint, it scales from zero when needed again, reducing costs when not in use. See the following code: