Amazon AWS – Page 63

Empowering everyone with GenAI to rapidly build, customize, and deploy apps securely: Highlights from the AWS New York Summit

July 10, 2024

by Swami Sivasubramanian Amazon AWS

Imagine this—all employees relying on generative artificial intelligence (AI) to get their work done faster, every task becoming less mundane and more innovative, and every application providing a more useful, personal, and engaging experience. To realize this future, organizations need more than a single, powerful large language model (LLM) or chat assistant. They need a full range of capabilities to build and scale generative AI applications that are tailored to their business and use case —including apps with built-in generative AI, tools to rapidly experiment and build their own generative AI apps, a cost-effective and performant infrastructure, and security controls and guardrails. That’s why we are investing in a comprehensive generative AI stack. At the top layer, which includes generative AI-powered applications, we have Amazon Q, the most capable generative AI-powered assistant. The middle layer has Amazon Bedrock, which provides tools to easily and rapidly build, deploy, and scale generative AI applications leveraging LLMs and other foundation models (FMs). And at the bottom, there’s our resilient, cost-effective infrastructure layer, which includes chips purpose-built for AI, as well as Amazon SageMaker to build and run FMs. All of these services are secure by design, and we keep adding features that are critical to deploying generative AI applications tailored to your business. During the last 18 months, we’ve launched more than twice as many machine learning (ML) and generative AI features into general availability than the other major cloud providers combined. That’s another reason why hundreds of thousands of customers are now using our AI services.

Today at the AWS New York Summit, we announced a wide range of capabilities for customers to tailor generative AI to their needs and realize the benefits of generative AI faster. We’re enabling anyone to build generative AI applications with Amazon Q Apps by writing a simple natural language prompt—in seconds. We’re making it easier to leverage your data, supercharge agents, and quickly, securely, and responsibly deploy generative AI into production with new features in Amazon Bedrock. And we announced new partnerships with innovators like Scale AI to help you customize your applications quickly and easily.

Generative AI-powered apps transform business as usual

Generative AI democratizes information, gives more people the ability to create and innovate, and provides access to productivity-enhancing assistance that was never available before. That’s why we’re building generative AI-powered applications for everyone.

Amazon Q, which includes Amazon Q Developer and Amazon Q Business, is the most capable generative AI-powered assistant for software development and helping employees make better decisions—faster—leveraging their company’s data. Not only does Amazon Q generate the industry’s most accurate coding suggestions, it can also autonomously perform multistep tasks like upgrading Java applications and generating and implementing new features. Amazon Q is where developers need it on the AWS Management Console and in popular integrated development environments, including IntelliJ IDEA, Visual Studio, VS Code, and Amazon SageMaker Studio. You can securely customize Amazon Q Developer with your internal code base to get more relevant and useful recommendations for in-line coding and save even more time. For instance, National Australia Bank has seen increased acceptance rates of 60%, up from 50% and Amazon Prime developers have already seen a 30% increase in acceptance rates. Amazon Q can also help employees do more with the vast troves of data and information contained in their company’s documents, systems, and applications by answering questions, providing summaries, generating business intelligence (BI) dashboards and reports, and even generating applications that automate key tasks. We’re super excited about the productivity gains customers and partners have seen, with early signals that Amazon Q could help their employees become over 80% more productive at their jobs.

To enable all employees to create their own generative AI applications to automate tasks, today we announced the general availability of Amazon Q Apps, a feature of Amazon Q Business. With Amazon Q Apps employees can go from conversation to generative AI-powered app based on their company data in seconds. Users simply describe the application they want in a prompt and Amazon Q instantly generates it. Amazon Q also gives employees the option to generate an app from an existing conversation with a single click. During preview, we saw users generate applications for diverse tasks, including summarizing feedback, creating onboarding plans, writing copy, drafting memos, and many more. For instance, Druva, a data security provider, created an Amazon Q App to support their request for proposal (RFP) process by summarizing the required information almost instantly, reducing RFP response times by up to 25%.

In addition to Amazon Q Apps, which makes it easy for any employee to automate their individual tasks, today we announced AWS App Studio (preview), a generative AI-powered service that enables technical professionals such as IT project managers, data engineers, and enterprise architects to use natural language to create, deploy, and manage enterprise applications across an organization. With App Studio, a user simply describes the application they want, what they want it to do, and the data sources they want to integrate with, and App Studio builds an application in minutes that could have taken a professional developer days to build a similar application from scratch. App Studio’s generative AI-powered assistant eliminates the learning curve of typical low-code tools, accelerating the application creation process and simplifying common tasks like designing the UI, building workflows, and testing the application. Each application can be immediately scaled to thousands of users and is secure and fully managed by AWS, eliminating the need for any operational expertise.

New features and capabilities supercharge Amazon Bedrock—speeding development of generative AI apps

Amazon Bedrock is the fastest and easiest way to build and scale secure generative AI applications with the broadest selection of leading LLMs and FMs as well as easy-to-use capabilities for developers. Tens of thousands of customers are already using Amazon Bedrock, and it’s one of AWS’s fastest growing services over the last decade. For example, Ferrari is rapidly introducing new experiences for customers, dealers, and internal teams to run faster simulations, create new knowledge bases that assist dealers and technical users, enhance the racing fan experience, and create hyper-personalized vehicle recommendations for customers from the millions of options offered by Ferrari in seconds.

Since the start of 2024, we have announced the general availability of more features and capabilities in Amazon Bedrock than comparable services from other leading cloud providers to help customers get generative AI apps from proof of concept to production faster. This includes support for new industry-leading models from Anthropic, Meta, Mistral, and more, as well as the recent addition of Anthropic Claude 3.5 Sonnet, their most advanced model to date, which was made available immediately for Amazon Bedrock customers. Thousands of customers have already used Anthropic’s Claude 3.5 since its release.

Today, we announced some major new Amazon Bedrock innovations that enable you to:

Customize generative AI applications with your data. You can customize generative AI applications with your data to make them specific to your use case, your organization, and your industry:

Fine tune Anthropic’s Claude 3 Haiku in Amazon Bedrock – With Amazon Bedrock, you can privately and securely fine tune Amazon Titan, Cohere Command and Command Lite, and Meta Llama 2 models by providing labeled data in Amazon Simple Storage Service (Amazon S3) to specialize the model for your business and use case. Starting today, Amazon Bedrock is also the only fully managed service that provides you with the ability to fine tune Anthropic’s Claude 3 Haiku (in preview). Read more in the News Blog.
Leverage even more data sources for Retrieval Augmented Generation (RAG) – With RAG, you can provide a model with new knowledge or up-to-date info from multiple sources, including document repositories, databases, and APIs. For example, the model might use RAG to retrieve search results from Amazon OpenSearch Service or documents from Amazon S3. Knowledge Bases for Amazon Bedrock fully manages this experience by connecting to your private data sources, including Amazon Aurora, Amazon OpenSearch Serverless, MongoDB, Pinecone, and Redis Enterprise Cloud. Today, we’ve expanded the list to include connectors for Salesforce, Confluence, and SharePoint (in preview), so organizations can leverage more business data to customize models for their specific needs. More knowledge base updates can be found in the News Blog.
Get the fastest vector search available – To further enhance your RAG workflows, we’ve added vector search to some of our most popular data services, including OpenSearch Service and OpenSearch Serverless, Aurora, Amazon Relational Database Service (Amazon RDS), and more. Customers can co-locate vector data with operational data, reducing the overhead of managing another database. Today, we’re also excited to announce the general availability of vector search for Amazon MemoryDB. Amazon MemoryDB delivers the fastest vector search performance at the highest recall rates among popular vector databases on AWS, making it a great fit for use cases that require single-digit millisecond latency. For example, Amazon Advertising, IBISWorld, Mediaset, and other organizations are using it to deliver real-time semantic search, and Broadridge Financial is running RAG while delivering the same real-time response rates that their customers are accustomed to. You can use MemoryDB vector search standalone today, and soon, you’ll be able to access it through Knowledge Bases for Amazon Bedrock. Read more about MemoryDB in the News Blog.

Create more advanced, personalized customer experiences. With Agents for Amazon Bedrock, applications can take action, executing multistep tasks using company systems and data sources, making generative AI applications substantially more useful. Today, we’re adding key capabilities to Agents for Amazon Bedrock. Previously, agents were limited to taking action based on information from within a single session. Now agents can retain memory across multiple interactions to remember where you last left off and provide better recommendations based on prior interactions. For instance, in a flight booking application, a developer can create an agent that can remember the last time you traveled or that you opt for a vegetarian meal. Agents can also now interpret code to tackle complex data-driven use cases, such as data analysis, data visualization, text processing, solving equations, and optimization problems. For instance, an application user can ask to analyze the historical real estate prices across various zip codes to identify investment opportunities. Check out the News Blogs for more on these capabilities.

De-risk generative AI with Guardrails for Amazon Bedrock. Customers are concerned about hallucinations, where LLMs generate incorrect responses by conflating multiple pieces of information, providing incorrect information, or inventing new information. These results can misinform employees and customers and harm brands, limiting the usefulness of generative AI. Today, we’re adding contextual grounding checks in Guardrails for Amazon Bedrock to detect hallucinations in model responses for applications using RAG and summarization applications. Contextual grounding checks add to the industry-leading safety protection in Guardrails for Amazon Bedrock to make sure the LLM response is based on the right enterprise source data and evaluates the LLM response to confirm that it’s relevant to the user’s query or instruction. Contextual grounding checks can detect and filter over 75% hallucinated responses for RAG and summarization workloads. Read more about our commitments to responsible AI on the AWS Machine Learning Blog.

We’re excited to see how our customers leverage these ever-expanding capabilities of Amazon Bedrock to customize their generative AI applications for vertical industries and business functions. For example, Deloitte is using Amazon Bedrock’s advanced customization capabilities to build their C-Suite AI solution, designed specifically for CFOs. It leverages Deloitte’s proprietary data and industry depth across the finance function. C-Suite AI provides customized AI models tailored to the needs of CFOs, with applications that span critical finance areas, generative analytics for data-driven insights, contract intelligence, and investor relations support.

New partners and trainings help customers along the AI journey

Our extensive partner network helps our customers along the journey to realizing the potential of generative AI. For example, BrainBox AI—which worked with our generative AI competency partner, Caylent—developed its AI assistant ARIA on AWS to help reduce energy costs and emissions in buildings. We have been building out our partner network and training offerings to help customers move quickly from experiment to broad usage. Our AWS Generative AI Competency Partner Program is designed to identify, validate, and promote AWS Partners with demonstrated AWS technical expertise and proven customer success. Today 19 new partners joined the program, giving customers access to 60 Generative AI Competency Partners across the globe. New partners include C3.ai, Cognizant, IBM, and LG CNS, and we have significantly expanded customer offerings into Korea, Greater China, and LATAM, and Saudi Arabia.

We’re also announcing a new partnership with Scale AI, our first model customization and evaluation partner. Through this collaboration, enterprise and public sector organizations can use Scale GenAI Platform and Scale Donovan to evaluate their generative AI applications and further customize, configure, and fine tune models to ensure trust and high performance in production, all built on Amazon Bedrock. Scale AI upholds the highest standards of privacy and regulatory compliance working with some of the most stringent government customers, such as the US Department of Defense. Customers can access Scale AI through an engagement with the AWS Generative AI Innovation Center, a program offered by AWS that pairs you with AWS science and strategy experts, or through the AWS Marketplace.

To help upskill your workforce, we’re making a new interactive online learning experience available, AWS SimuLearn, that pairs generative AI-powered simulations and hands-on training, to help people learn how to translate business problems into technical solutions. This is part of our broader commitment to provide free cloud computing skills training to 29 million people worldwide by 2025. Today, we announced that we surpassed this milestone, more than a year ahead of schedule.

We’re giving customers tools that put the power of generative AI into all employees’ hands, providing more ways to create personalized and relevant generative AI-powered applications, and working on the tough problems like reducing hallucinations so more companies can gain benefits from generative AI. We’re energized by the progress our customers have already made in making generative AI a reality for their organizations and will continue to innovate on their behalf. To watch the New York Summit keynote for an in-depth look at these announcements, visit our AWS New York Summit page or learn more about our generative AI services.

About the author

Swami Sivasubramanian is Vice President of Data and Machine Learning at AWS. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.

A progress update on our commitment to safe, responsible generative AI

July 10, 2024

by Vasi Philomin Amazon AWS

Responsible AI is a longstanding commitment at Amazon. From the outset, we have prioritized responsible AI innovation by embedding safety, fairness, robustness, security, and privacy into our development processes and educating our employees. We strive to make our customers’ lives better while also establishing and implementing the necessary safeguards to help protect them. Our practical approach to transform responsible AI from theory into practice, coupled with tools and expertise, enables AWS customers to implement responsible AI practices effectively within their organizations. To date, we have developed over 70 internal and external offerings, tools, and mechanisms that support responsible AI, published or funded over 500 research papers, studies, and scientific blogs on responsible AI, and delivered tens of thousands of hours of responsible AI training to our Amazon employees. Amazon also continues to expand its portfolio of free responsible AI training courses for people of all ages, backgrounds, and levels of experience.

Today, we are sharing a progress update on our responsible AI efforts, including the introduction of new tools, partnerships, and testing that improve the safety, security, and transparency of our AI services and models.

Launched new tools and capabilities to build and scale generative AI safely, supported by adversarial style testing (i.e., red teaming)

In April 2024, we announced the general availability of Guardrails for Amazon Bedrock and Model Evaluation in Amazon Bedrock to make it easier to introduce safeguards, prevent harmful content, and evaluate models against key safety and accuracy criteria. Guardrails is the only solution offered by a major cloud provider that enables customers to build and customize safety and privacy protections for their generative AI applications in a single solution. It helps customers block up to 85% of harmful content on top of the native protection from FMs on Amazon Bedrock.

In May, we published a new AI Service Card for Amazon Titan Text Premier to further support our investments in responsible, transparent generative AI. AI Service Cards are a form of responsible AI documentation that provide customers with a single place to find information on the intended use cases and limitations, responsible AI design choices, and deployment and performance optimization best practices for our AI services and models. We’ve created more than 10 AI Service Cards thus far to deliver transparency for our customers as part of our comprehensive development process that addresses fairness, explainability, veracity and robustness, governance, transparency, privacy and security, safety, and controllability.

AI systems can also have performance flaws and vulnerabilities that can increase risk around security threats or harmful content. At Amazon, we test our AI systems and models, such as Amazon Titan, using a variety of techniques, including manual red-teaming. Red-teaming engages human testers to probe an AI system for flaws in an adversarial style, and complements our other testing techniques, which include automated benchmarking against publicly available and proprietary datasets, human evaluation of completions against proprietary datasets, and more. For example, we have developed proprietary evaluation datasets of challenging prompts that we use to assess development progress on Titan Text. We test against multiple use cases, prompts, and data sets because it is unlikely that a single evaluation dataset can provide an absolute picture of performance. Altogether, Titan Text has gone through multiple iterations of red-teaming on issues including safety, security, privacy, veracity, and fairness.

Introduced watermarking to enable users to determine if visual content is AI-generated

A common use case for generative AI is the creation of digital content, like images, videos, and audio, but to help prevent disinformation, users need to be able to able to identify AI-generated content. Techniques such as watermarking can be used to confirm if it comes from a particular AI model or provider. To help reduce the spread of disinformation, all images generated by Amazon Titan Image Generator have an invisible watermark by default. It is designed to be tamper-resistant, helping increase transparency around AI-generated content and combat disinformation. We also introduced a new API (preview) in Amazon Bedrock that checks for the existence of this watermark and helps you confirm whether an image was generated by Titan Image Generator.

Promoted collaboration among companies and governments regarding trust and safety risks

Collaboration among companies, governments, researchers, and the AI community is critical to foster the development of AI that is safe, responsible, and trustworthy. In February 2024, Amazon joined the U.S. Artificial Intelligence Safety Institute Consortium, established by the National Institute of Standards and Technology (NIST). Amazon is collaborating with NIST to establish a new measurement science to enable the identification of scalable, interoperable measurements and methodologies to promote development of trustworthy AI. We are also contributing $5 million in AWS compute credits to the Institute for the development of tools and methodologies to evaluate the safety of foundation models. Also in February, Amazon joined the “Tech Accord to Combat Deceptive Use of AI in 2024 Elections” at the Munich Security Conference. This is an important part of our collective work to advance safeguards against deceptive activity and protect the integrity of elections.

We continue to find new ways to engage in and encourage information-sharing among companies and governments as the technology continues to evolve. This includes our work with Thorn and All Tech is Human to safely design our generative AI services to reduce the risk that they will be misused for child exploitation. We’re also a member of the Frontier Model Forum to advance the science, standards, and best practices in the development of frontier AI models.

Used AI as a force for good to address society’s greatest challenges and supported initiatives that foster education

At Amazon, we are committed to promoting the safe and responsible development of AI as a force for good. We continue to see examples across industries where generative AI is helping to address climate change and improve healthcare. Brainbox AI, a pioneer in commercial building technology, launched the world’s first generative AI-powered virtual building assistant on AWS to deliver insights to facility managers and building operators that will help optimize energy usage and reduce carbon emissions. Gilead, an American biopharmaceutical company, accelerates life-saving drug development with AWS generative AI by understanding a clinical study’s feasibility and optimizing site selection through AI-driven protocol analysis utilizing both internal and real-world datasets.

As we navigate the transformative potential of these technologies, we believe that education is the foundation for realizing their benefits while mitigating risks. That’s why we offer education on potential risks surrounding generative AI systems. Amazon employees have spent tens of thousands of training hours since July 2023, covering a range of critical topics like risk assessments, as well as deep dives into complex considerations surrounding fairness, privacy, and model explainability. As part of Amazon’s “AI Ready” initiative to provide free AI skills training to 2 million people globally by 2025, we’ve launched new free training courses about safe and responsible AI use on our digital learning centers. The courses include “Introduction to Responsible AI” for new-to-cloud learners on AWS Educate and courses like “Responsible AI Practices,” and “Security, Compliance, and Governance for AI Solutions” on AWS Skill Builder.

Delivering groundbreaking innovation with trust at the forefront

As an AI pioneer, Amazon continues to foster the safe, responsible, and trustworthy development of AI technology. We are dedicated to driving innovation on behalf of our customers while also establishing and implementing the necessary safeguards. We’re also committed to working with companies, governments, academia, and researchers alike to deliver groundbreaking generative AI innovation with trust at the forefront.

About the author

Vasi Philomin is VP of Generative AI at AWS. He leads generative AI efforts, including Amazon Bedrock and Amazon Titan.

Fine-tune Anthropic’s Claude 3 Haiku in Amazon Bedrock to boost model accuracy and quality

July 10, 2024

by Yanyan Zhang Amazon AWS

Frontier large language models (LLMs) like Anthropic Claude on Amazon Bedrock are trained on vast amounts of data, allowing Anthropic Claude to understand and generate human-like text. Fine-tuning Anthropic Claude 3 Haiku on proprietary datasets can provide optimal performance on specific domains or tasks. The fine-tuning as a deep level of customization represents a key differentiating factor by using your own unique data.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) along with a broad set of capabilities to build generative artificial intelligence (AI) applications, simplifying development with security, privacy, and responsible AI. With Amazon Bedrock custom models, you can customize FMs securely with your data. According to Anthropic, Claude 3 Haiku is the fastest and most cost-effective model on the market for its intelligence category. You can now fine-tune Anthropic Claude 3 Haiku in Amazon Bedrock in a preview capacity in the US West (Oregon) AWS Region. Amazon Bedrock is the only fully managed service that provides you with the ability to fine-tune Anthropic Claude models.

This post introduces the workflow of fine-tuning Anthropic Claude 3 Haiku in Amazon Bedrock. We first introduce the general concept of fine-tuning and then focus on the important steps in fining-tuning the model, including setting up permissions, preparing for data, commencing the fine-tuning jobs, and conducting evaluation and deployment of the fine-tuned models.

Solution overview

Fine-tuning is a technique in natural language processing (NLP) where a pre-trained language model is customized for a specific task. During fine-tuning, the weights of the pre-trained Anthropic Claude 3 Haiku model will get updated to enhance its performance on a specific target task. Fine-tuning allows the model to adapt its knowledge to the task-specific data distribution and vocabulary. Hyperparameters like learning rate and batch size need to be tuned for optimal fine-tuning.

Fine-tuning Anthropic Claude 3 Haiku in Amazon Bedrock offers significant advantages for enterprises. This process enhances task-specific model performance, allowing the model to handle custom use cases with task-specific performance metrics that meet or surpass more powerful models like Anthropic Claude 3 Sonnet or Anthropic Claude 3 Opus. As a result, businesses can achieve improved performance with reduced costs and latency. Essentially, fine-tuning Anthropic Claude 3 Haiku provides you with a versatile tool to customize Anthropic Claude, enabling you to meet specific performance and latency goals efficiently.

You can benefit from fine-tuning Anthropic Claude 3 Haiku in different use cases, using your own data. The following use cases are well-suited for fine-tuning the Anthropic Claude 3 Haiku model:

Classification – For example, when you have 10,000 labeled examples and want Anthropic Claude to do really well at this task
Structured outputs – For example, when you need Anthropic Claude’s response to always conform to a given structure
Industry knowledge – For example, when you need to teach Anthropic Claude how to answer questions about your company or industry
Tools and APIs – For example, when you need to teach Anthropic Claude how to use your APIs really well

In the following sections, we go through the steps of fine-tuning and deploying Anthropic Claude 3 Haiku in Amazon Bedrock using the Amazon Bedrock console and the Amazon Bedrock API.

Prerequisites

To use this feature, make sure you have satisfied the following requirements:

An active AWS account.
Anthropic Claude 3 Haiku enabled in Amazon Bedrock. You can confirm it’s enabled on the Model access page of the Amazon Bedrock console.
Access to the preview of Anthropic Claude 3 Haiku fine-tuning in Amazon Bedrock. To request access, contact your AWS account team or submit a support ticket using the AWS Management Console. When creating the support ticket, choose Bedrock for Service and Models for Category.
The required training dataset (and optional validation dataset) prepared and stored in Amazon Simple Storage Service (Amazon S3).

To create a model customization job using Amazon Bedrock, you need to create an AWS Identity and Access Management (IAM) role with the following permissions (for more details, see Create a service role for model customization):

A trust relationship, which allows Amazon Bedrock to assume the role
Permissions to access training and validation data in Amazon S3
Permissions to write output data to Amazon S3
Optionally, permissions to decrypt an AWS Key Management Service (AWS KMS) key if you have encrypted resources with a KMS key

The following code is the trust relationship, which allows Amazon Bedrock to assume the IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "account-id"
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:bedrock:us-west-2:account-id:model-customization-job/*"
                }
            }
        }
    ] 
}

Prepare the data

To fine-tune the Anthropic Claude 3 Haiku model, the training data must be in JSON Lines (JSONL) format, where each line represents a single training record. Specifically, the training data format aligns with the MessageAPI:

{"system": string, "messages": [{"role": "user", "content": string}, {"role": "assistant", "content": string}]}
{"system": string, "messages": [{"role": "user", "content": string}, {"role": "assistant", "content": string}]}
{"system": string, "messages": [{"role": "user", "content": string}, {"role": "assistant", "content": string}]}

The following is an example from a text summarization use case used as one-line input for fine-tuning Anthropic Claude 3 Haiku in Amazon Bedrock. In JSONL format, each record is one text line.

{
"system": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.",
"messages": [
{"role": "user", "content": "instruction:nnSummarize the news article provided below.nninput:nSupermarket customers in France can add airline tickets to their shopping lists thanks to a unique promotion by a budget airline. ... Based at the airport, new airline launched in 2007 and is a low-cost subsidiary of the airline."},
{"role": "assistant", "content": "New airline has included voucher codes with the branded products ... to pay a booking fee and checked baggage fees ."}
]
}

You can invoke the fine-tuned model using the same MessageAPI format, providing consistency. In each line, the "system" message is optional information, which is a way of providing context and instructions to the model, such as specifying a particular goal or role, often known as a system prompt. The "user" content corresponds to the user’s instruction, and the "assistant" content is the desired response that the fine-tuned model should provide. Fine-tuning Anthropic Claude 3 Haiku in Amazon Bedrock supports both single-turn and multi-turn conversations. If you want to use multi-turn conversations, the data format for each line is as follows:

{"system": string, "messages": [{"role": "user", "content": string}, {"role": "assistant", "content": string}, {"role": "user", "content": string}, {"role": "assistant", "content": string}]}

The last line’s "assistant" role represents the desired output from the fine-tuned model, and the previous chat history serves as the prompt input. For both single-turn and multi-turn conversation data, the total length of each record (including system, user, and assistant content) should not exceed 32,000 tokens.

In addition to your training data, you can prepare validation and test datasets. Although it’s optional, a validation dataset is recommended because it allows you to monitor the model’s performance during training. This dataset enables features like early stopping and helps improve model performance and convergence. Separately, a test dataset is used to evaluate the final model’s performance after training is complete. Both additional datasets follow a similar format to your training data, but serve distinct purposes in the fine-tuning process.

If you’re already using Amazon Bedrock to fine-tune Amazon Titan, Meta Llama, or Cohere models, the training data should follow this format:

{"prompt": "<prompt1>", "completion": "<expected generated text>"}
{"prompt": "<prompt2>", "completion": "<expected generated text>"}
{"prompt": "<prompt3>", "completion": "<expected generated text>"}

For data in this format, you can use the following Python code to convert to the required format for fine-tuning:

import json

# Define the system string, leave it empty if not needed
system_string = ""

# Input file path
input_file = "Orig-FT-Data.jsonl"

# Output file path
output_file = "Haiku-FT-Data.jsonl"

with open(input_file, "r") as f_in, open(output_file, "w") as f_out:
    for line in f_in:
        data = json.loads(line)
        prompt = data["prompt"]
        completion = data["completion"]

        new_data = {}
        if system_string:
            new_data["system"] = system_string
        new_data["messages"] = [
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": completion}
        ]

        f_out.write(json.dumps(new_data) + "n")

print("Conversion completed!")

To optimize the fine-tuning performance, the quality of training data is more important than the size of the dataset. We recommend starting with a small but high-quality training dataset (50–100 rows of data is a reasonable start) to fine-tune the model and evaluate its performance. Based on the evaluation results, you can then iterate and refine the training data. Generally, as the size of the high-quality training data increases, you can expect to achieve better performance from the fine-tuned model. However, it’s essential to maintain a focus on data quality, because a large but low-quality dataset may not yield the desired improvements in the fine-tuned model performance.

Currently, the requirements for the number of records in training and validation data for fine-tuning Anthropic Claude 3 Haiku align with the customization limits set by Amazon Bedrock for fine-tuning other models. Specifically, the training data should not exceed 10,000 records, and the validation data should not exceed 1,000 records. These limits provide efficient resource utilization while allowing for model optimization and evaluation within a reasonable data scale.

Fine-tune the model

Fine-tuning Anthropic Claude 3 Haiku in Amazon Bedrock allows you to configure various hyperparameters that can significantly impact the fine-tuning process and the resulting model’s performance. The following table summarizes the supported hyperparameters.

Name	Description	Type	Default	Value Range
`epochCount`	The maximum number of iterations through the entire training dataset. `Epochcount` is equivalent to epoch.	integer	2	1–10
`batchSize`	The number of samples processed before updating model parameters.	integer	32	4–256
`learningRateMultiplier`	The multiplier that influences the learning rate at which model parameters are updated after each batch.	float	1	0.1–2
`earlyStoppingThreshold`	The minimum improvement in validation loss required to prevent premature stopping of the training process.	float	0.001	0–0.1
`earlyStoppingPatience`	The tolerance for stagnation in the validation loss metric before stopping the training process.	int	2	1–10

The learningRateMultiplier parameter is a factor that adjusts the base learning rate set by the model itself, which determines the actual learning rate applied during the training process by scaling the model’s base learning rate with this multiplier factor. Typically, you should increase the batchSize when the training dataset size increases, and you may need to perform hyperparameter optimization (HPO) to find the optimal settings. Early stopping is a technique used to prevent overfitting and stop the training process when the validation loss stops improving. The validation loss is computed at the end of each epoch. If the validation loss has not decreased enough (determined by earlyStoppingThreshold) for earlyStoppingPatience times, the training process will be stopped.

For example, the following table shows example validation losses for each epoch during a training process.

Epoch	Validation Loss
1	0.9
2	0.8
3	0.7
4	0.66
5	0.64
6	0.65
7	0.65

The following table illustrates the behavior of early stopping during the training, based on different configurations of earlyStoppingThreshold and earlyStoppingPatience.

Scenario	earlyStopping Threshold	earlyStopping Patience	Training Stopped	Best Checkpoint
1	0	2	Epoch 7	Epoch 5 (val loss 0.64)
2	0.05	1	Epoch 4	Epoch 4 (val loss 0.66)

Choosing the right hyperparameter values is crucial for achieving optimal fine-tuning performance. You may need to experiment with different settings or use techniques like HPO to find the best configuration for your specific use case and dataset.

Run the fine-tuning job on the Amazon Bedrock console

Make sure you have access to the preview of Anthropic Claude 3 Haiku fine-tuning in Amazon Bedrock, as discussed in the prerequisites. After you’re granted access, complete the following steps:

On the Amazon Bedrock console, choose Foundation models in the navigation pane.
Choose Custom models.
In the Models section, on the Customize model menu, choose Create Fine-tuning job.

For Category, choose Anthropic.
For Models available for fine-tuning, choose Claude 3 Haiku.
Choose Apply.

For Fine-tuned model name, enter a name for the model.
Select Model encryption to add a KMS key.
Optionally, expand the Tags section to add tags for tracking.
For Job name, enter a name for the training job.

Before you start a fine-tuning job, create an S3 bucket in the same Region as your Amazon Bedrock service (for example, us-west-2), as mentioned in the prerequisites. At the time of writing, fine-tuning for Anthropic Claude 3 Haiku in Amazon Bedrock is available in preview in the US West (Oregon) Region. Within this S3 bucket, set up separate folders for your training data, validation data, and fine-tuning artifacts. Upload your training and validation datasets to their respective folders.

Under Input data, specify the S3 locations for both your training and validation datasets.

This setup enforces proper data access and Regional compatibility for your fine-tuning process.

Next, you configure the hyperparameters for your fine-tuning job.

Set the number of epochs, batch size, and learning rate multiplier.
If you’ve included a validation dataset, you can enable early stopping.

This feature allows you to set an early stopping threshold and patience value. Early stopping helps prevent overfitting by halting the training process when the model’s performance on the validation set stops improving.

Under Output data, for S3 location, enter the S3 path for the bucket storing fine-tuning metrics.
Under Service access, select a method to authorize Amazon Bedrock. You can select Use an existing service role if you have an access role with fine-grained IAM policies or select Create and use a new service role.
After you have added all the required configurations for fine-tuning Anthropic Claude 3 Haiku, choose Create Fine- tuning job.

When the fine-tuning job starts, you can see the status of the training job (Training or Complete) under Jobs.

As the fine-tuning job progresses, you can find more information about the training job, including job creation time, job duration, input data, and hyperparameters used for the fine-tuning job. Under Output data, you can navigate to the fine-tuning folder in the S3 bucket, where you can find the training and validation metrics that were computed as part of the fine-tuning job.

Run the fine-tuning job using the Amazon Bedrock API

Make sure to request access to the preview of Anthropic Claude 3 Haiku fine-tuning in Amazon Bedrock, as discussed in the prerequisites.

To start a fine-tuning job for Anthropic Claude 3 Haiku using the Amazon Bedrock API, complete the following steps:

Create an Amazon Bedrock client and set the base model ID for the Anthropic Claude 3 Haiku model:

import boto3
bedrock = boto3.client(service_name="bedrock")
base_model_id = "anthropic.claude-3-haiku-20240307-v1:0:200k"

Generate a unique job name and custom model name, typically using a timestamp:

from datetime import datetime
ts = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
customization_job_name = f"model-finetune-job-{ts}"
custom_model_name = f"finetuned-model-{ts}"

Specify the IAM role ARN that has the necessary permissions to access the required resources for the fine-tuning job, as discussed in the prerequisites:
```
customization_role = "arn:aws:iam::<YOUR_AWS_ACCOUNT_ID>:role/<YOUR_IAM_ROLE_NAME>"
```

Set the customization type to FINE_TUNING and define the hyperparameters for fine-tuning the model, as discussed in the previous session:

customization_type = "FINE_TUNING"
hyper_parameters = {
"epochCount": "5",
"batchSize": "32",
"learningRateMultiplier": "0.05",
"earlyStoppingThreshold": "0.001",
"earlyStoppingPatience": "2"
}

Configure the S3 bucket and prefix where the fine-tuned model and output data will be stored, and provide the S3 data paths for your training and validation datasets (the validation dataset is optional):

s3_bucket_name = "<YOUR_S3_BUCKET_NAME>"
s3_bucket_config = f"s3://{s3_bucket_name}/outputs/output-{custom_model_name}"
s3_train_uri = "s3://<YOUR_S3_BUCKET_NAME>/<YOUR_TRAINING_DATA_PREFIX>"
s3_validation_uri = "s3://<YOUR_S3_BUCKET_NAME>/<YOUR_VALIDATION_DATA_PREFIX>"
training_data_config = {"s3Uri": s3_train_uri}
validation_data_config = {
    "validators": [{
        "s3Uri": s3_validation_uri
    }]
}

With these configurations in place, you can create the fine-tuning job using the create_model_customization_job method from the Amazon Bedrock client, passing in the required parameters:

training_job_response = bedrock.create_model_customization_job(
    customizationType=customization_type,
    jobName=customization_job_name,
    customModelName=custom_model_name,
    roleArn=customization_role,
    baseModelIdentifier=base_model_id,
    hyperParameters=hyper_parameters,
    trainingDataConfig=training_data_config,
    validationDataConfig=validation_data_config,
    outputDataConfig=output_data_config
)

The create_model_customization method will return a response containing information about the created fine-tuning job. You can monitor the job’s progress and retrieve the fine-tuned model when the job is complete, either through the Amazon Bedrock API or Amazon Bedrock console.

Deploy and evaluate the fine-tuned model

After successfully fine-tuning the model, you can evaluate the fine-tuning metrics recorded during the process. These metrics are stored in the specified S3 bucket for evaluation purposes. For the training data, step-wise training metrics are recorded with columns, including step_number, epoch_number, and training_loss.

If you provided a validation dataset, additional validation metrics are stored in a separate file, including step_number, epoch_number, and corresponding validation_loss.

When you’re satisfied with the fine-tuning metrics, you can purchase Provisioned Throughput to deploy your fine-tuned model, which allows you to take advantage of the improved performance and specialized capabilities of the fine-tuned model in your applications. Provisioned Throughput refers to the number and rate of inputs and outputs that a model processes and returns. To use a fine-tuned model, you must purchase Provisioned Throughput, which is billed hourly. The pricing for Provisioned Throughput depends on the following factors:

The base model the fine-tuned model was customized from.
The number of Model Units (MUs) specified for the Provisioned Throughput. MU is a unit that specifies the throughput capacity for a given model; each MU defines the number of input tokens it can process and output tokens it can generate across all requests within 1 minute.
The commitment duration, which can be no commitment, 1 month, or 6 months. Longer commitments offer more discounted hourly rates.

After Provisioned Throughput is set up, you can use the MessageAPI to invoke the fine-tuned model, similar to how the base model is invoked. This provides a seamless transition and maintains compatibility with existing applications or workflows.

It’s crucial to evaluate the performance of the fine-tuned model to make sure it meets the desired criteria and outperforms in specific tasks. You can conduct various evaluations, including comparing the fine-tuned model with the base model, or even evaluating performance against more advanced models, like Anthropic Claude 3 Sonnet.

Deploy the fine-tuned model using the Amazon Bedrock console

To deploy the fine-tuned model using the Amazon Bedrock console, complete the following steps:

On the Amazon Bedrock console, choose Custom models in the navigation pane.
Select the fine-tuned model and choose Purchase Provisioned Throughput.

For Provisioned Throughput name¸ enter a name.
Choose the model you want to deploy.
For Commitment term, choose your level of commitment (for this post, we choose No commitment).
Choose Purchase Provisioned Throughput.

After the fine-tuned model has been deployed using Provisioned Throughput, you can see the model status as In service when you go to the Provisioned Throughput page on the Amazon Bedrock console.

You can use the fine-tuned model deployed using Provisioned Throughput for task-specific use cases. In the Amazon Bedrock playground, you can find the fine-tuned model under Custom models and use it for inference.

Deploy the fine-tuned model using the Amazon Bedrock API

To deploy the fine-tuned model using the Amazon Bedrock API, complete the following steps:

Retrieve the fine-tuned model ID from the job’s output, and create a Provisioned Throughput model instance with the desired model units:

import boto3
bedrock = boto3.client(service_name='bedrock')

custom_model_id = training_job_response["customModelId"]
provisioned_model_id = bedrock.create_provisioned_model_throughput(
modelUnits=1,
provisionedModelName='finetuned-haiku-model',
modelId=custom_model_id
)['provisionedModelArn']

When the Provisioned Throughput model is ready, you can call the invoke_model function from the Amazon Bedrock runtime client to generate text using the fine-tuned model:

import json
bedrock_runtime = boto3.client(service_name='bedrock-runtime')

body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 2048,
"messages": [{"role": "user", "content": <YOUR_INPUT_PROMPT_STRING>}],
"temperature": 0.1,
"top_p": 0.9,
"system": <YOUR_SYSTEM_PROMPT_STRING>
})

fine_tuned_response = bedrock_runtime.invoke_model(body=body, modelId=provisioned_model_id)
fine_tuned_response_body = json.loads(fine_tuned_response.get('body').read())
print("Fine tuned model response:", fine_tuned_response_body['content'][0]['text']+'n')

By following these steps, you can deploy and use your fine-tuned Anthropic Claude 3 Haiku model through the Amazon Bedrock API, allowing you to generate customized Anthropic Claude 3 Haiku models tailored to your specific requirements.

Conclusion

Fine-tuning Anthropic Claude 3 Haiku in Amazon Bedrock empowers enterprises to optimize this LLM for your specific needs. By combining Amazon Bedrock with Anthropic Claude 3 Haiku’s speed and cost-effectiveness, you can efficiently customize the model while maintaining robust security. This process enhances the model’s accuracy and tailors its outputs to unique business requirements, driving significant improvements in efficiency and effectiveness.

Fine-tuning Anthropic Claude 3 Haiku in Amazon Bedrock is now available in preview in the US West (Oregon) Region. To request access to the preview, contact your AWS account team or submit a support ticket.

About the Authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Sovik Kumar Nath is an AI/ML and Generative AI Senior Solutions Architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double master’s degrees from the University of South Florida and University of Fribourg, Switzerland, and a bachelor’s degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and going on adventures.

Carrie Wu is an Applied Scientist at Amazon Web Services, working on fine-tuning large language models for alignment to custom tasks and responsible AI. She graduated from Stanford University with a PhD in Management Science and Engineering. Outside of work, she loves reading, traveling, aerial yoga, ice skating, and spending time with her dog.

Fang Liu is a principal machine learning engineer at Amazon Web Services, where he has extensive experience in building AI/ML products using cutting-edge technologies. He has worked on notable projects such as Amazon Transcribe and Amazon Bedrock. Fang Liu holds a master’s degree in computer science from Tsinghua University.

Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 2

July 9, 2024

by James Wu Amazon AWS

As generative artificial intelligence (AI) inference becomes increasingly critical for businesses, customers are seeking ways to scale their generative AI operations or integrate generative AI models into existing workflows. Model optimization has emerged as a crucial step, allowing organizations to balance cost-effectiveness and responsiveness, improving productivity. However, price-performance requirements vary widely across use cases. For chat applications, minimizing latency is key to offer an interactive experience, whereas real-time applications like recommendations require maximizing throughput. Navigating these trade-offs poses a significant challenge to rapidly adopt generative AI, because you must carefully select and evaluate different optimization techniques.

To overcome these challenges, we are excited to introduce the inference optimization toolkit, a fully managed model optimization feature in Amazon SageMaker. This new feature delivers up to ~2x higher throughput while reducing costs by up to ~50% for generative AI models such as Llama 3, Mistral, and Mixtral models. For example, with a Llama 3-70B model, you can achieve up to ~2400 tokens/sec on a ml.p5.48xlarge instance v/s ~1200 tokens/sec previously without any optimization.

This inference optimization toolkit uses the latest generative AI model optimization techniques such as compilation, quantization, and speculative decoding to help you reduce the time to optimize generative AI models from months to hours, while achieving the best price-performance for your use case. For compilation, the toolkit uses the Neuron Compiler to optimize the model’s computational graph for specific hardware, such as AWS Inferentia, enabling faster runtimes and reduced resource utilization. For quantization, the toolkit utilizes Activation-aware Weight Quantization (AWQ) to efficiently shrink the model size and memory footprint while preserving quality. For speculative decoding, the toolkit employs a faster draft model to predict candidate outputs in parallel, enhancing inference speed for longer text generation tasks. To learn more about each technique, refer to Optimize model inference with Amazon SageMaker. For more details and benchmark results for popular open source models, see Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 1.

In this post, we demonstrate how to get started with the inference optimization toolkit for supported models in Amazon SageMaker JumpStart and the Amazon SageMaker Python SDK. SageMaker JumpStart is a fully managed model hub that allows you to explore, fine-tune, and deploy popular open source models with just a few clicks. You can use pre-optimized models or create your own custom optimizations. Alternatively, you can accomplish this using the SageMaker Python SDK, as shown in the following notebook. For the full list of supported models, refer to Optimize model inference with Amazon SageMaker.

Using pre-optimized models in SageMaker JumpStart

The inference optimization toolkit provides pre-optimized models that have been optimized for best-in-class cost-performance at scale, without any impact to accuracy. You can choose the configuration based on the latency and throughput requirements of your use case and deploy in a single click.

Taking the Meta-Llama-3-8b model in SageMaker JumpStart as an example, you can choose Deploy from the model page. In the deployment configuration, you can expand the model configuration options, select the number of concurrent users, and deploy the optimized model.

Deploying a pre-optimized model with the SageMaker Python SDK

You can also deploy a pre-optimized generative AI model using the SageMaker Python SDK in just a few lines of code. In the following code, we set up a ModelBuilder class for the SageMaker JumpStart model. ModelBuilder is a class in the SageMaker Python SDK that provides fine-grained control over various deployment aspects, such as instance types, network isolation, and resource allocation. You can use it to create a deployable model instance, converting framework models (like XGBoost or PyTorch) or Inference Specs into SageMaker-compatible models for deployment. Refer to Create a model in Amazon SageMaker with ModelBuilder for more details.

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens":128, "do_sample":True}
}

sample_output = [
    {
        "generated_text": "Hello, I'm a language model, and I'm here to help you with your English."
    }
]
schema_builder = SchemaBuilder(sample_input, sample_output)

builder = ModelBuilder(
    model="meta-textgeneration-llama-3-8b", # JumpStart model ID
    schema_builder=schema_builder,
    role_arn=role_arn,
)

List the available pre-benchmarked configurations with the following code:

builder.display_benchmark_metrics()

Choose the appropriate instance_type and config_name from the list based on your concurrent users, latency, and throughput requirements. In the preceding table, you can see the latency and throughput across different concurrency levels for the given instance type and config name. If config name is lmi-optimized, that means the configuration is pre-optimized by SageMaker. Then you can call .build() to run the optimization job. When the job is complete, you can deploy to an endpoint and test the model predictions. See the following code:

# set deployment config with pre-configured optimization
bulder.set_deployment_config(
    instance_type="ml.g5.12xlarge", 
    config_name="lmi-optimized"
)

# build the deployable model
model = builder.build()

# deploy the model to a SageMaker endpoint
predictor = model.deploy(accept_eula=True)

# use sample input payload to test the deployed endpoint
predictor.predict(sample_input)

Using the inference optimization toolkit to create custom optimizations

In addition to creating a pre-optimized model, you can create custom optimizations based on the instance type you choose. The following table provides a full list of available combinations. In the following sections, we explore compilation on AWS Inferentia first, and then try the other optimization techniques for GPU instances.

Instance Types	Optimization Technique	Configurations
AWS Inferentia	Compilation	Neuron Compiler
GPUs	Quantization	AWQ
GPUs	Speculative Decoding	SageMaker provided or Bring Your Own (BYO) draft model

Compilation from SageMaker JumpStart

For compilation, you can select the same Meta-Llama-3-8b model from SageMaker JumpStart and choose Optimize on the model page. In the optimization configuration page, you can choose ml.inf2.8xlarge for your instance type. Then provide an output Amazon Simple Storage Service (Amazon S3) location for the optimized artifacts. For large models like Llama 2 70B, for example, the compilation job can take more than an hour. Therefore, we recommend using the inference optimization toolkit to perform ahead-of-time compilation. That way, you only need to compile one time.

Compilation using the SageMaker Python SDK

For the SageMaker Python SDK, you can configure the compilation by changing the environment variables in the .optimize() function. For more details on compilation_config, refer to LMI NeuronX ahead-of-time compilation of models tutorial.

compiled_model = builder.optimize(
    instance_type="ml.inf2.8xlarge",
    accept_eula=True,
    compilation_config={
        "OverrideEnvironment": {
            "OPTION_TENSOR_PARALLEL_DEGREE": "2",
            "OPTION_N_POSITIONS": "2048",
            "OPTION_DTYPE": "fp16",
            "OPTION_ROLLING_BATCH": "auto",
            "OPTION_MAX_ROLLING_BATCH_SIZE": "4",
            "OPTION_NEURON_OPTIMIZE_LEVEL": "2",
        }
   },
   output_path=f"s3://{output_bucket_name}/compiled/"
)

# deploy the compiled model to a SageMaker endpoint
predictor = compiled_model.deploy(accept_eula=True)

# use sample input payload to test the deployed endpoint
predictor.predict(sample_input)

Quantization and speculative decoding from SageMaker JumpStart

For optimizing models on GPU, ml.g5.12xlarge is the default deployment instance type for Llama-3-8b. You can choose quantization, speculative decoding, or both as optimization options. Quantization uses AWQ to reduce the model’s weights to low-bit (INT4) representations. Finally, you can provide an output S3 URL to store the optimized artifacts.

With speculative decoding, you can improve latency and throughput by either using the SageMaker provided draft model or bringing your own draft model from the public Hugging Face model hub or upload from your own S3 bucket.

After the optimization job is complete, you can deploy the model or run further evaluation jobs on the optimized model. On the SageMaker Studio UI, you can choose to use the default sample datasets or provide your own using an S3 URI. At the time of writing, the evaluate performance option is only available through the Amazon SageMaker Studio UI.

Quantization and speculative decoding using the SageMaker Python SDK

The following is the SageMaker Python SDK code snippet for quantization. You just need to provide the quantization_config attribute in the .optimize() function.

optimized_model = builder.optimize(
    instance_type="ml.g5.12xlarge",
    accept_eula=True,
    quantization_config={
        "OverrideEnvironment": {
            "OPTION_QUANTIZE": "awq"
        }
    },
    output_path=f"s3://{output_bucket_name}/quantized/"
)

# deploy the optimized model to a SageMaker endpoint
predictor = optimized_model.deploy(accept_eula=True)

# use sample input payload to test the deployed endpoint
predictor.predict(sample_input)

For speculative decoding, you can change to a speculative_decoding_config attribute and configure SageMaker or a custom model. You may need to adjust the GPU utilization based on the sizes of the draft and target models to fit them both on the instance for inference.

optimized_model = builder.optimize(
    instance_type="ml.g5.12xlarge",
    accept_eula=True,
    speculative_decoding_config={
        "ModelProvider": "sagemaker"
    }
    # speculative_decoding_config={
        # "ModelProvider": "custom",
        # use S3 URI or HuggingFace model ID for custom draft model
        # note: using HuggingFace model ID as draft model requires HF_TOKEN in environment variables
        # "ModelSource": "s3://custom-bucket/draft-model", 
    # }
)

# deploy the optimized model to a SageMaker endpoint
predictor = optimized_model.deploy(accept_eula=True)

# use sample input payload to test the deployed endpoint
predictor.predict(sample_input)

Conclusion

Optimizing generative AI models for inference performance is crucial for delivering cost-effective and responsive generative AI solutions. With the launch of the inference optimization toolkit, you can now optimize your generative AI models, using the latest techniques such as speculative decoding, compilation, and quantization to achieve up to ~2x higher throughput and reduce costs by up to ~50%. This helps you achieve the optimal price-performance balance for your specific use cases with just a few clicks in SageMaker JumpStart or a few lines of code using the SageMaker Python SDK. The inference optimization toolkit significantly simplifies the model optimization process, enabling your business to accelerate generative AI adoption and unlock more opportunities to drive better business outcomes.

To learn more, refer to Optimize model inference with Amazon SageMaker and Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 1.

About the Authors

James Wu is a Senior AI/ML Specialist Solutions Architect
Saurabh Trikande is a Senior Product Manager
Rishabh Ray Chaudhury is a Senior Product Manager
Kumara Swami Borra is a Front End Engineer
Alwin (Qiyun) Zhao is a Senior Software Development Engineer
Qing Lan is a Senior SDE

Achieve up to ~2x higher throughput while reducing costs by ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 1

July 9, 2024

by Raghu Ramesha Amazon AWS

Today, Amazon SageMaker announced a new inference optimization toolkit that helps you reduce the time it takes to optimize generative artificial intelligence (AI) models from months to hours, to achieve best-in-class performance for your use case. With this new capability, you can choose from a menu of optimization techniques, apply them to your generative AI models, validate performance improvements, and deploy the models in just a few clicks.

By employing techniques such as speculative decoding, quantization, and compilation, Amazon SageMaker’s new inference optimization toolkit delivers up to ~2x higher throughput while reducing costs by up to ~50% for generative AI models such as Llama 3, Mistral, and Mixtral models. For example, with a Llama 3-70B model, you can achieve up to ~2400 tokens/sec on a ml.p5.48xlarge instance v/s ~1200 tokens/sec previously without any optimization. Additionally, the inference optimization toolkit significantly reduces the engineering costs of applying the latest optimization techniques, because you don’t need to allocate developer resources and time for research, experimentation, and benchmarking before deployment. You can now focus on your business objectives instead of the heavy lifting involved in optimizing your models.

In this post, we discuss the benefits of this new toolkit and the use cases it unlocks.

Benefits of the inference optimization toolkit

“Large language models (LLMs) require expensive GPU-based instances for hosting, so achieving a substantial cost reduction is immensely valuable. With the new inference optimization toolkit from Amazon SageMaker, based on our experimentation, we expect to reduce deployment costs of our self-hosted LLMs by roughly 30% and to reduce latency by up to 25% for up to 8 concurrent requests” said FNU Imran, Machine Learning Engineer, Qualtrics.

Today, customers try to improve price-performance by optimizing their generative GenAI models with techniques such as speculative decoding, quantization, and compilation. Speculative decoding achieves speedup by predicting and computing multiple potential next tokens in parallel, thereby reducing the overall runtime, without loss in accuracy. Quantization reduces the memory requirements of the model by using a lower-precision data type to represent weights and activations. Compilation optimizes the model to deliver the best-in-class performance on the chosen hardware type, without loss in accuracy.

However, it takes months of developer time to optimally implement generative AI optimization techniques—you need to go through a myriad of open source documentation, iteratively prototype different techniques, and benchmark before finalizing on the deployment configurations. Additionally, there’s a lack of compatibility across techniques and various open source libraries, making it difficult to stack different techniques for best price-performance.

The inference optimization toolkit helps address these challenges by making the process simpler and more efficient. You can select from a menu of latest model optimization techniques and apply them to your models. You can also select a combination of techniques to create an optimization recipe for their models. Then you can run benchmarks using your custom data to evaluate the impact of the techniques on the output quality and the inference performance in just a few clicks. SageMaker will do the heavy lifting of provisioning the required hardware to run the optimization recipe using the most efficient set of deep learning frameworks and libraries, and provide compatibility with target hardware so that the techniques can be efficiently stacked together. You can now deploy popular models like Llama 3 and Mistral available on Amazon SageMaker JumpStart with accelerated inference techniques within minutes, either using the Amazon SageMaker Studio UI or the Amazon SageMaker Python SDK. A number of preset serving configurations are exposed for each model, along with precomputed benchmarks that provide you with different options to pick between lower cost (higher concurrency) or lower per-user latency.

Using the new inference optimization toolkit, we benchmarked the performance and cost impact of different optimization techniques. The toolkit allowed us to evaluate how each technique affected throughput and overall cost-efficiency for our question answering use case.

Speculative decoding

Speculative decoding is an inference technique that aims to speed up the decoding process of large and therefore slow LLMs for latency-critical applications without compromising the quality of the generated text. The key idea is to use a smaller, less powerful, but faster language model called the draft model to generate candidate tokens that are then validated by the larger, more powerful, but slower target model. At each iteration, the draft model generates multiple candidate tokens. The target model verifies the tokens and if it finds a particular token is not acceptable, it rejects it and regenerates that itself. Therefore, the larger model may be doing both verification and some small amount of token generation. The smaller model is significantly faster than the larger model. It can generate all the tokens quickly and then send batches of these tokens to the target models for verification. The target models evaluate them all in parallel, significantly speeding up the final response generation (verification is faster than auto-regressive token generation). For a more detailed understanding, refer to the paper from DeepMind, Accelerating Large Language Model Decoding with Speculative Sampling.

With the new inference optimization toolkit from SageMaker, you get out-of-the-box support for speculative decoding from SageMaker that has been tested for performance at scale for various popular open models. SageMaker offers a pre-built draft model that you can use out of the box, eliminating the need to invest time and resources in building your own draft model from scratch. If you prefer to use your own custom draft model, SageMaker also supports this option, providing flexibility to accommodate your specific custom draft models.

The following graph showcases the throughput (tokens per second) for a Llama3-70B model deployed on a ml.p5.48xlarge using speculative decoding provided through SageMaker compared to a deployment without speculative decoding.

While the results below use a ml.p5.48xlarge, you can also look at exploring deploying Llama3-70 with speculative decoding on a ml.p4d.24xlarge.

The dataset used for these benchmarks is based on a curated version of the OpenOrca question answering use case, where the payloads are between 500–1,000 tokens with mean 622 with respect to the Llama 3 tokenizer. All requests set the maximum new tokens to 250.

Given the increase in throughput that is realized with speculative decoding, we can also see the blended price difference when using speculative decoding vs. when not using speculative decoding.

Here we have calculated the blended price as a 3:1 ratio of input to output tokens.

The blended price is defined as follows:

Total throughput (tokens per second) = (1/(p50 inter token latency)) x concurrency
Blended price ($ per 1 million tokens) = (1−(discount rate)) × (instance per hour price) ÷ ((total token throughput per second)×60×60÷10^6)) ÷ 4

Check out the following notebook to learn how to enable speculative decoding using the optimization toolkit for a pre-trained SageMaker JumpStart model.

The following are not supported when using the SageMaker provided draft model for speculative decoding:

If you have fine-tuned your model outside of SageMaker JumpStart
The custom inference script is not supported when using SageMaker draft models
Local testing
Inference components

Quantization

Quantization is one of the most popular model compression methods to reduce your memory footprint and accelerate inference. By using a lower-precision data type to represent weights and activations, quantizing LLM weights for inference provides four main benefits:

Reduced hardware requirements for model serving – A quantized model can be served using less expensive and more available GPUs or even made accessible on consumer devices or mobile platforms.
Increased space for the KV cache – This enables larger batch sizes and sequence lengths.
Faster decoding latency – Because the decoding process is memory bandwidth bound, less data movement from reduced weight sizes directly improves decoding latency, unless offset by dequantization overhead.
A higher compute-to-memory access ratio (through reduced data movement) – This is also known as arithmetic intensity. This allows for fuller utilization of available compute resources during decoding.

For quantization, the inference optimization toolkit from SageMaker provides compatibility and supports Activation-aware Weight Quantization (AWQ) for GPUs. AWQ is an efficient and accurate low-bit (INT3/4) post-training weight-only quantization technique for LLMs, supporting instruction-tuned models and multi-modal LLMs. By quantizing the model weights to INT4 using AWQ, you can deploy larger models (like Llama 3 70B) on ml.g5.12xlarge, which is 79.88% cheaper than ml.p4d.24xlarge based on the 1 year SageMaker Savings Plan rate. The memory footprint of INT4 weights is four times smaller than that of native half-precision weights (35 GB vs. 140 GB for Llama 3 70B).

The following graph compares the throughput of an AWQ quantized Llama 3 70B instruct model on ml.g5.12xlarge against a Llama 3 70B instruct model on ml.p4d.24xlarge.There could be implications to the accuracy of the AWQ quantized model due to compression. Having said that, the price-performance is better on ml.g5.12xlarge and the throughput per instance is lower. You can add additional instances to your SageMaker endpoint according to your use case. We can see the cost savings realized in the following blended price graph.

In the following graph, we have calculated the blended price as a 3:1 ratio of input to output tokens. In addition, we applied the 1 year SageMaker Savings Plan rate for the instances.

Refer to the following notebook to learn more about how to enable AWQ quantization and speculative decoding using the optimization toolkit for a pre-trained SageMaker JumpStart model. If you want to deploy a fine-tuned model with SageMaker JumpStart using speculative decoding, refer to the following notebook.

Compilation

Compilation optimizes the model to extract the best available performance on the chosen hardware type, without any loss in accuracy. For compilation, the SageMaker inference optimization toolkit provides efficient loading and caching of optimized models to reduce model loading and auto scaling time by up to 40–60 % for Llama 3 8B and 70B.

Model compilation enables running LLMs on accelerated hardware, such as GPUs or custom silicon like AWS Trainium and AWS Inferentia, while simultaneously optimizing the model’s computational graph for optimal performance on the target hardware. On Trainium and AWS Inferentia, the Neuron Compiler ingests deep learning models in various formats such as PyTorch and safetensors, and then optimizes them to run efficiently on Neuron devices. When using the Large Model Inference (LMI) Deep Learning Container (DLC), the Neuron Compiler is invoked from within the framework and creates compiled artifacts. These compiled artifacts are unique for a combination of input shapes, precision of the model, tensor parallel degree, and other framework- or compiler-level configurations. Although the compilation process avoids overhead during inference and enables optimized inference, it can take a lot of time.

To avoid re-compiling every time a model is deployed onto a Neuron device, SageMaker introduces the following features:

A cache of pre-compiled artifacts – This includes popular models like Llama 3, Mistral, and more for Trainium and AWS Inferentia 2. When using an optimized model with the compilation config, SageMaker automatically uses these cached artifacts when the configurations match.
Ahead-of-time compilation – The inference optimization toolkit enables you to compile your models with the desired configurations before deploying them on SageMaker.

The following graph illustrates the improvement in model loading time when using pre-compiled artifacts with the SageMaker LMI DLC. The models were compiled with a sequence length of 8,192 and a batch size of 8, with Llama 3 8B deployed on an inf2.48xlarge (tensor parallel degree = 24) and Llama 3 70B on a trn1.32xlarge (tensor parallel degree = 32).

Refer to the following notebook for more information on how to enable Neuron compilation using the optimization toolkit for a pre-trained SageMaker JumpStart model.

Pricing

For compilation and quantization jobs, SageMaker will optimally choose the right instance type, so you don’t have to spend time and effort. You will be charged based on the optimization instance used. To learn model, see Amazon SageMaker pricing. For speculative decoding, there is no optimization involved; QuickSilver will package the right container and parameters for the deployment. Therefore, there are no additional costs to you.

Conclusion

Refer to Achieve up to 2x higher throughput while reducing cost by up to 50% for GenAI inference on SageMaker with new inference optimization toolkit: user guide – Part 2 blog to learn to get started with the inference optimization toolkit when using SageMaker inference with SageMaker JumpStart and the SageMaker Python SDK. You can use the inference optimization toolkit on any supported models on SageMaker JumpStart. For the full list of supported models, refer to Optimize model inference with Amazon SageMaker.

About the authors

Raghu Ramesha is a Senior GenAI/ML Solutions Architect
Marc Karp is a Senior ML Solutions Architect
Ram Vegiraju is a Solutions Architect
Pierre Lienhart is a Deep Learning Architect
Pinak Panigrahi is a Senior Solutions Architect Annapurna ML
Rishabh Ray Chaudhury is a Senior Product Manager

Anthropic Claude 3.5 Sonnet ranks number 1 for business and finance in S&P AI Benchmarks by Kensho

July 9, 2024

by Qingwei Li Amazon AWS

Anthropic Claude 3.5 Sonnet currently ranks at the top of S&P AI Benchmarks by Kensho, which assesses large language models (LLMs) for finance and business. Kensho is the AI Innovation Hub for S&P Global. Using Amazon Bedrock, Kensho was able to quickly run Anthropic Claude 3.5 Sonnet through a challenging suite of business and financial tasks. We discuss these tasks and the capabilities of Anthropic Claude 3.5 Sonnet in this post.

Limitations of LLM evaluations

It is a common practice to use standardized tests, such as Massive Multitask Language Understanding (MMLU, a test consisting of multiple-choice questions that cover 57 disciplines like math, philosophy, and medicine) and HumanEval (testing code generation), to evaluate LLMs. Although these evaluations are useful in giving LLM users a sense of an LLM’s relative performance, they have limitations. For example, there could be leakage of benchmark datasets’ questions and answers into training data. Additionally, today’s LLMs work well for general tasks, such as question answering tasks and code generation. However, these capabilities don’t always translate to domain-specific tasks. In the financial services industry, we hear customers ask which model to choose for their financial domain generative artificial intelligence (AI) applications. These applications require the LLMs to have requisite domain knowledge and be able to reason about numeric data to calculate metrics and extract insights. We have also heard from customers that highly ranked general benchmark LLMs don’t necessarily provide them with the best performance for their given finance and business applications.

Our customers often ask us if we have a benchmark of LLMs just for the financial industry that could help them pick the right LLMs faster.

S&P AI Benchmarks by Kensho

When Kensho’s R&D lab began to research and develop useful, challenging datasets for finance and business, it quickly became clear that within the finance industry, there was a scarcity of such realistic evaluations. To address this challenge, the lab created S&P AI Benchmarks, which aims to serve as the industry standard for benchmarking models for finance and business.

“By offering a robust and independent benchmarking solution, we want to help the financial services industry make smart decisions about which models to implement for which use cases.”

– Bhavesh Dayalji, Chief AI Officer of S&P Global and CEO of Kensho.

S&P AI Benchmarks focuses on measuring models’ ability to perform tasks that center around three categories of capabilities and knowledge: domain knowledge, quantity extraction, and quantitative reasoning (more details can be found in this paper). This publicly available resource includes a corresponding leaderboard, which allows everyone to see the performance of every state-of-the-art language model that has been evaluated on these rigorous tasks. Anthropic Claude 3.5 Sonnet is currently ranked number one (as of July 2024), demonstrating Anthropic’s strengths in the business and finance domain.

Kensho chose to test their benchmark with Amazon Bedrock because of its ease of use and enterprise-ready security and privacy controls.

The evaluation tasks

S&P AI Benchmarks evaluates LLMs using a wide range of questions concerning finance and business. The evaluation comprises 600 questions spanning three categories: domain knowledge, quantity extraction, and quantitative reasoning. Each question has been verified by domain experts and finance professionals with over 5 years of experience.

Quantitative reasoning

This task determines if, given a question and lengthy documents, the model can perform complex calculations and correctly reason to produce an accurate answer. The questions are written by financial professionals using real-world data and financial knowledge. As such, they are closer to the kinds of questions that business and financial professionals would ask in a generative AI application. The following is an example:

Question: The market price of K-T-Lew Corporation’s common stock is $60 per share, and each share gives its owner one subscription right. Four rights are required to purchase an additional share of common stock at the subscription price of $54 per share. If the common stock is currently selling rights-on, what is the theoretical value of a right? Answer to the nearest cent.

To answer the question, LLMs must resolve complex quantity references and use implicit financial background knowledge. For example, “subscription right,” “selling rights-on,” and “subscription price” in the preceding question require financial background knowledge to understand the terms. To generate the answer, LLMs need to have the financial knowledge of calculating the “theoretical value of a right.”

Quantity extraction

Given financial reports, an LLM can extract the pertinent numerical information. Many business and finance workflows require high-precision quantity extraction. In the following example, for an LLM to answer the question correctly, it needs to understand the table row represents location and the column represents year, and then extract the correct quantity (total amount) from the table based on the asked location and year:

Question: What was the Total Americas amount in 2019? (thousand)

Given Context: The Company’s top ten clients accounted for 42.2%, 44.2% and 46.9% of 
its consolidated revenues during the years ended December 31, 2019, 2018 and 2017, respectively.
The following table represents a disaggregation of revenue from contracts with customers by 
delivery location (in thousands):

Years Ended December 31,
	2019	2018	2017
Americas:	.	.	.
United States	$614,493	$668,580	$644,870
The Philippines	250,888	231,966	241,211
Costa Rica	127,078	127,963	132,542
Canada	99,037	102,353	112,367
El Salvador	81,195	81,156	75,800
Other	123,969	118,620	118,853
Total Americas	1,296,660	1,330,638	1,325,643
EMEA:	.	.	.
Germany	94,166	91,703	81,634
Other	223,847	203,251	178,649
Total EMEA	318,013	294,954	260,283
Total Other	89	95	82
.	$1,614,762	$1,625,687	$1,586,008

Domain knowledge

Models must demonstrate an understanding of business and financial terms, practices, and formulae. The task is to answer multiple-choice questions collected from CFA practice exams and the business ethics, microeconomics, and professional accounting exams from the MMLU dataset. In the following example question, the LLM needs to understand what a fixed-rate system is:

Question: A fixed-rate system is characterized by:
A: Explicit legislative commitment to maintain a specified parity.
B: Monetary independence being subject to the maintenance of an exchange rate peg.
C: Target foreign exchange reserves bearing a direct relationship to domestic monetary aggregates.

Anthropic Claude 3.5 Sonnet on Amazon Bedrock

In addition to ranking at the top on S&P AI Benchmarks, Anthropic Claude 3.5 Sonnet yields state-of-the-art performance on a wide range of other tasks, including undergraduate-level expert knowledge (MMLU), graduate-level expert reasoning (GPQA), code (HumanEval), and more. As pointed out in Anthropic’s Claude 3.5 Sonnet model now available in Amazon Bedrock: Even more intelligence than Claude 3 Opus at one-fifth the cost, Anthropic Claude 3.5 Sonnet made key improvements in visual processing and understanding, writing and content generation, natural language processing, coding, and generating insights.

Get started with Anthropic Claude 3.5 Sonnet on Amazon Bedrock

Anthropic Claude 3.5 Sonnet is generally available in Amazon Bedrock as part of the Anthropic Claude family of AI models. Amazon Bedrock is a fully managed service that offers quick access to a choice of industry-leading LLMs and other foundation models from AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon. It also offers a broad set of capabilities to build generative AI applications, simplifying development while supporting privacy and security. Tens of thousands of customers have already selected Amazon Bedrock as the foundation for their generative AI strategy. Customers from the financial industry such as Nasdaq, NYSE, Broadridge, Jefferies, NatWest, and more use Amazon Bedrock to build their generative AI applications.

“The Kensho team uses Amazon Bedrock to quickly evaluate models from several different providers. In fact, access to Amazon Bedrock allowed the team to benchmark Anthropic Claude 3.5 Sonnet within 24 hours.”

– Diana Mingels, Head of Machine Learning at Kensho.

Conclusion

In this post, we walked through the S&P AI Benchmarks task details for business and finance. The benchmark shows that Anthropic Claude 3.5 Sonnet is the leading performer in these tasks. To start using this new model, see Anthropic Claude models. With Amazon Bedrock, you get a fully managed service offering access to leading AI models from companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications. Learn more and get started today at Amazon Bedrock.

About the authors

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Joe Dunn is an AWS Principal Solutions Architect in Financial Services with over 20 years of experience in infrastructure architecture and migration of business-critical loads to AWS. He helps financial services customers to innovate on the AWS Cloud by providing solutions using AWS products and services.

Raghvender Arni (Arni) is a part of the AWS Generative AI GTM team and leads the Cross-Portfolio team which is a multidisciplinary group of AI specialists dedicated to accelerating and optimizing generative AI adoption across industries.

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Scott Mullins is Managing Director and General Manger of AWS’ Worldwide Financial Services organization. In this role, Scott is responsible for AWS’ relationships with systemically important financial institutions, and for leading the development and execution of AWS’ strategic initiatives across Banking, Payments, Capital Markets, and Insurance around the world. Prior to joining AWS in 2014, Scott’s 28-year career in financial services included roles at JPMorgan Chase, Nasdaq, Merrill Lynch, and Penson Worldwide. At Nasdaq, Scott was the Product Manager responsible for building the exchange’s first cloud-based solution, FinQloud. Before joining NASDAQ, Scott ran Surveillance and Trading Compliance for one of the nation’s largest clearing broker-dealers, with responsibility for regulatory response, emerging regulatory initiatives, and compliance matters related to the firm’s trading and execution services divisions. Prior to his roles in regulatory compliance, Scott spent 10 years as an equity trader. A graduate of Texas A&M University, Scott is a subject matter expert quoted in industry media, and a recognized speaker at industry events..

The Weather Company enhances MLOps with Amazon SageMaker, AWS CloudFormation, and Amazon CloudWatch

July 8, 2024

by Qaish Kanchwala Amazon AWS

This blog post is co-written with Qaish Kanchwala from The Weather Company.

As industries begin adopting processes dependent on machine learning (ML) technologies, it is critical to establish machine learning operations (MLOps) that scale to support growth and utilization of this technology. MLOps practitioners have many options to establish an MLOps platform; one among them is cloud-based integrated platforms that scale with data science teams. AWS provides a full-stack of services to establish an MLOps platform in the cloud that is customizable to your needs while reaping all the benefits of doing ML in the cloud.

In this post, we share the story of how The Weather Company (TWCo) enhanced its MLOps platform using services such as Amazon SageMaker, AWS CloudFormation, and Amazon CloudWatch. TWCo data scientists and ML engineers took advantage of automation, detailed experiment tracking, integrated training, and deployment pipelines to help scale MLOps effectively. TWCo reduced infrastructure management time by 90% while also reducing model deployment time by 20%.

The need for MLOps at TWCo

TWCo strives to help consumers and businesses make informed, more confident decisions based on weather. Although the organization has used ML in its weather forecasting process for decades to help translate billions of weather data points into actionable forecasts and insights, it continuously strives to innovate and incorporate leading-edge technology in other ways as well. TWCo’s data science team was looking to create predictive, privacy-friendly ML models that show how weather conditions affect certain health symptoms and create user segments for improved user experience.

TWCo was looking to scale its ML operations with more transparency and less complexity to allow for more manageable ML workflows as their data science team grew. There were noticeable challenges when running ML workflows in the cloud. TWCo’s existing Cloud environment lacked transparency for ML jobs, monitoring, and a feature store, which made it hard for users to collaborate. Managers lacked the visibility needed for ongoing monitoring of ML workflows. To address these pain points, TWCo worked with the AWS Machine Learning Solutions Lab (MLSL) to migrate these ML workflows to Amazon SageMaker and the AWS Cloud. The MLSL team collaborated with TWCo to design an MLOps platform to meet the needs of its data science team, factoring present and future growth.

Examples of business objectives set by TWCo for this collaboration are:

Achieve quicker reaction to the market and faster ML development cycles
Accelerate TWCo migration of their ML workloads to SageMaker
Improve end user experience through adoption of manage services
Reduce time spent by engineers in maintenance and upkeep of the underlying ML infrastructure

Functional objectives were set to measure the impact of MLOps platform users, including:

Improve the data science team’s efficiency in model training tasks
Decrease the number of steps required to deploy new models
Reduce the end-to-end model pipeline runtime

Solution overview

The solution uses the following AWS services:

AWS CloudFormation – Infrastructure as code (IaC) service to provision most templates and assets.
AWS CloudTrail – Monitors and records account activity across AWS infrastructure.
Amazon CloudWatch – Collects and visualizes real-time logs that provide the basis for automation.
AWS CodeBuild – Fully managed continuous integration service to compile source code, runs tests, and produces ready-to-deploy software. Used to deploy training and inference code.
AWS CodeCommit – Managed sourced control repository that stores MLOps infrastructure code and IaC code.
AWS CodePipeline – Fully managed continuous delivery service that helps automate the release of pipelines.
Amazon SageMaker – Fully managed ML platform to perform ML workflows from exploring data, training, and deploying models.
AWS Service Catalog – Centrally manages cloud resources such as IaC templates used for MLOps projects.
Amazon Simple Storage Service (Amazon S3) – Cloud object storage to store data for training and testing.

The following diagram illustrates the solution architecture.

This architecture consists of two primary pipelines:

Training pipeline – The training pipeline is designed to work with features and labels stored as a CSV-formatted file on Amazon S3. It involves several components, including Preprocess, Train, and Evaluate. After training the model, its associated artifacts are registered with the Amazon SageMaker Model Registry through the Register Model component. The Data Quality Check part of the pipeline creates baseline statistics for the monitoring task in the inference pipeline.
Inference pipeline – The inference pipeline handles on-demand batch inference and monitoring tasks. Within this pipeline, SageMaker on-demand Data Quality Monitor steps are incorporated to detect any drift when compared to the input data. The monitoring results are stored in Amazon S3 and published as a CloudWatch metric, and can be used to set up an alarm. The alarm is used later to invoke training, send automatic emails, or any other desired action.

The proposed MLOps architecture includes flexibility to support different use cases, as well as collaboration between various team personas like data scientists and ML engineers. The architecture reduces the friction between cross-functional teams moving models to production.

ML model experimentation is one of the sub-components of the MLOps architecture. It improves data scientists’ productivity and model development processes. Examples of model experimentation on MLOps-related SageMaker services require features like Amazon SageMaker Pipelines, Amazon SageMaker Feature Store, and SageMaker Model Registry using the SageMaker SDK and AWS Boto3 libraries.

When setting up pipelines, resources are created that are required throughout the lifecycle of the pipeline. Additionally, each pipeline may generate its own resources.

The pipeline setup resources are:

Training pipeline:
- SageMaker pipeline
- SageMaker Model Registry model group
- CloudWatch namespace
Inference pipeline:
- SageMaker pipeline

The pipeline run resources are:

Training pipeline:
- SageMaker model

You should delete these resources when the pipelines expire or are no longer needed.

SageMaker project template

In this section, we discuss the manual provisioning of pipelines through an example notebook and automatic provisioning of SageMaker pipelines through the use of a Service Catalog product and SageMaker project.

By using Amazon SageMaker Projects and its powerful template-based approach, organizations establish a standardized and scalable infrastructure for ML development, allowing teams to focus on building and iterating ML models, reducing time wasted on complex setup and management.

The following diagram shows the required components of a SageMaker project template. Use Service Catalog to register a SageMaker project CloudFormation template in your organization’s Service Catalog portfolio.

To start the ML workflow, the project template serves as the foundation by defining a continuous integration and delivery (CI/CD) pipeline. It begins by retrieving the ML seed code from a CodeCommit repository. Then the BuildProject component takes over and orchestrates the provisioning of SageMaker training and inference pipelines. This automation delivers a seamless and efficient run of the ML pipeline, reducing manual intervention and speeding up the deployment process.

Dependencies

The solution has the following dependencies:

Amazon SageMaker SDK – The Amazon SageMaker Python SDK is an open source library for training and deploying ML models on SageMaker. For this proof of concept, pipelines were set up using this SDK.
Boto3 SDK – The AWS SDK for Python (Boto3) provides a Python API for AWS infrastructure services. We use the SDK for Python to create roles and provision SageMaker SDK resources.
SageMaker Projects – SageMaker Projects delivers standardized infrastructure and templates for MLOps for rapid iteration over multiple ML use cases.
Service Catalog – Service Catalog simplifies and speeds up the process of provisioning resources at scale. It offers a self-service portal, standardized service catalog, versioning and lifecycle management, and access control.

Conclusion

In this post, we showed how TWCo uses SageMaker, CloudWatch, CodePipeline, and CodeBuild for their MLOps platform. With these services, TWCo extended the capabilities of its data science team while also improving how data scientists manage ML workflows. These ML models ultimately helped TWCo create predictive, privacy-friendly experiences that improved user experience and explains how weather conditions impact consumers’ daily planning or business operations. We also reviewed the architecture design that helps maintain responsibilities between different users modularized. Typically data scientists are only concerned with the science aspect of ML workflows, whereas DevOps and ML engineers focus on the production environments. TWCo reduced infrastructure management time by 90% while also reducing model deployment time by 20%.

This is just one of many ways AWS enables builders to deliver great solutions. We encourage to you to get started with Amazon SageMaker today.

About the Authors

Qaish Kanchwala is a ML Engineering Manager and ML Architect at The Weather Company. He has worked on every step of the machine learning lifecycle and designs systems to enable AI use cases. In his spare time, Qaish likes to cook new food and watch movies.

Chezsal Kamaray is a Senior Solutions Architect within the High-Tech Vertical at Amazon Web Services. She works with enterprise customers, helping to accelerate and optimize their workload migration to the AWS Cloud. She is passionate about management and governance in the cloud and helping customers set up a landing zone that is aimed at long-term success. In her spare time, she does woodworking and tries out new recipes while listening to music.

Anila Joshi has more than a decade of experience building AI solutions. As an Applied Science Manager at the AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and guides customers to strategically chart a course into the future of AI.

Kamran Razi is a Machine Learning Engineer at the Amazon Generative AI Innovation Center. With a passion for creating use case-driven solutions, Kamran helps customers harness the full potential of AWS AI/ML services to address real-world business challenges. With a decade of experience as a software developer, he has honed his expertise in diverse areas like embedded systems, cybersecurity solutions, and industrial control systems. Kamran holds a PhD in Electrical Engineering from Queen’s University.

Shuja Sohrawardy is a Senior Manager at AWS’s Generative AI Innovation Center. For over 20 years, Shuja has utilized his technology and financial services acumen to transform financial services enterprises to meet the challenges of a highly competitive and regulated industry. Over the past 4 years at AWS, Shuja has used his deep knowledge in machine learning, resiliency, and cloud adoption strategies, which has resulted in numerous customer success journeys. Shuja holds a BS in Computer Science and Economics from New York University and an MS in Executive Technology Management from Columbia University.

Francisco Calderon is a Data Scientist at the Generative AI Innovation Center (GAIIC). As a member of the GAIIC, he helps discover the art of the possible with AWS customers using generative AI technologies. In his spare time, Francisco likes playing music and guitar, playing soccer with his daughters, and enjoying time with his family.

Eviden scales AWS DeepRacer Global League using AWS DeepRacer Event Manager

July 8, 2024

by Sathya Paduchuri Amazon AWS

Eviden is a next-gen technology leader in data-driven, trusted, and sustainable digital transformation. With a strong portfolio of patented technologies and worldwide leading positions in advanced computing, security, AI, cloud, and digital platforms, Eviden provides deep expertise for a multitude of industries in more than 47 countries. Eviden is an AWS Premier partner, bringing together 47,000 world-class talents and expanding the possibilities of data and technology across the digital continuum, now and for generations to come. Eviden is an Atos Group company with an annual revenue of over €5 billion.

We are passionate about our people improving their skills, and support the development of the next generation of cloud-centered talent. Although fundamental knowledge gained through training and certification is important, there’s no substitute for getting hands-on. We complement individual learning with hands-on opportunities, including Immersion Days, Gamedays, and using AWS DeepRacer.

AWS DeepRacer empowers users to train reinforcement learning models on the AWS Cloud and race them around a virtual track. Unlike traditional programming, where you define the desired output, AWS DeepRacer allows you to define rewards for specific behaviors, such as going faster or staying centered on the track. This hands-on experience gives learners an excellent opportunity to engage with the AWS Management Console and develop Python-based reward functions, fostering valuable skills in cloud-centered technologies and machine learning (ML).

To elevate the event experience and streamline the management of their global AWS DeepRacer series, Eviden adopted the open source AWS DeepRacer Event Manager (DREM) solution. In this post, we discuss the benefits of DREM and the experience for racers, event staff, and spectators.

Introducing AWS DeepRacer Event Manager

AWS DeepRacer Event Manager is an innovative application that Eviden has deployed within its own AWS environment. Comprised of AWS Cloud-centered services, DREM is designed to simplify the process of hosting in-person AWS DeepRacer events, while also delivering a more engaging and immersive experience for both participants and spectators.

With their prior experience hosting AWS DeepRacer events in the UK, Eviden sought to expand the reach of this exciting initiative globally. By adopting the DREM solution, Eviden’s experienced event staff in the UK were able to seamlessly support their counterparts in hosting AWS DeepRacer events for the first time in locations such as Bydgoszcz, Paris, and Pune. The DREM solution empowers Eviden to seamlessly configure and manage their global AWS DeepRacer events. Within the platform, each event location’s AWS DeepRacer cars are registered, enabling remote configuration and model uploads powered by AWS Systems Manager. Additionally, a Raspberry Pi device is registered at each location to serve as an integrated timing solution, which uses DREM’s data-driven racing capabilities to capture and report critical performance metrics for each racer, such as best lap time, average lap time, and total laps completed.

Furthermore, the deep integration between DREM’s timing solution and the event’s streaming overlay has enabled Eviden to deliver a significantly more engaging experience for both in-person attendees and remote viewers, so everyone can stay fully informed and immersed in the action throughout the event.

The following diagram illustrates the DREM architecture and components.

Racer experience

For racers, the DREM experience begins with registration, during which they are encouraged to upload their AWS DeepRacer models in advance of the event. Authentication is seamlessly handled by Amazon Cognito, with out-of-the-box support for a local Amazon Cognito identity store, as well as the flexibility to integrate with a corporate identity provider solution if required. After the registered racers have uploaded their models, DREM automatically scans for any suspicious content, quarantining it as necessary, before making the verified models available for the racing competition.

Event staff experience

For the event staff, the DREM solution greatly simplifies the process of running an AWS DeepRacer competition. User management, model uploading, and naming conventions are all handled seamlessly within the platform, eliminating any potential confusion around model ownership. To further enforce the integrity of the competition, DREM applies MD5 hashing to the uploaded models, preventing any unauthorized model sharing between racers. Additionally, the DREM interface makes it remarkably straightforward and efficient to upload multiple models to the AWS DeepRacer cars, providing a far superior experience compared to the cars’ native graphical UI. DREM also simplifies the management of the AWS DeepRacer car fleets and Raspberry Pi timing devices, empowering the event staff to remotely remove models, restart the AWS DeepRacer service, and even print user-friendly labels for the cars, making it effortless for racers to connect to them using the provided tablets.

To further streamline the event management process, DREM provides pre-built scripts that enable seamless registration of devices, which can then be fully managed remotely. The timekeeping functionality is automatically handled within DREM, using Raspberry Pi devices, pressure sensors, and pressure sensor trimming—either through custom DIY modifications or the use of the excellent Digital Racing Kings boards. The DREM timekeeping system represents a significant improvement over the alternative solutions Eviden has used in the past. It captures critical race metrics, including remaining time, all lap times, and the fastest lap. Additionally, the system provides an option to invalidate a lap, such as in cases where a car went off-track. After a racer has completed their runs, the data is securely stored in DREM, and the leaderboard is automatically updated to reflect the latest results.

Spectator experience

As a global systems integrator, Eviden was determined to deliver a truly spectacular experience, not only for the onsite participants in their AWS DeepRacer finals, but also for those in the room observing the event, as well as remote viewers racing using a proxy or simply interested in watching the competition unfold. To achieve this, Eviden took advantage of the DREM solution’s seamlessly integrated streaming overlay and leaderboard capabilities, which kept all attendees, both in-person and online, fully engaged and informed of the current standings throughout the event.

In previous AWS DeepRacer events, participants had to rely on someone in the room to verbally communicate lap times and remaining race time. However, with DREM, both the racers and spectators have instant access to all the critical timing information, keeping everyone fully up to date. This was especially beneficial for remote participants, who could now clearly see which cars were on the track and follow the progress of their fellow racers, with the streaming overlay dynamically updating to display the top positions on the leaderboard.

In addition, DREM offers a dedicated webpage displaying the complete leaderboard, which can be conveniently shown on screens in the event space, as well as allow remote attendees to follow the competition progress from other locations throughout the day.

The significant improvements to the event experience were clearly reflected in the feedback received from this year’s participants:

“Joining remotely was superb.”
“Remote interaction over teams was much improved this year.”
“The physical event was excellent, well attended, and ran very smoothly.”
“Loved it, even though I was attending remote.”
“The event itself was fantastic, with each stage being well-planned and organized. The on-site atmosphere exceeded my expectations, making it an incredible event to be part of.”

Well-architected

The DREM solution has been meticulously designed with a well-architected approach. From the event organizer’s perspective, the peace of mind of knowing that DREM is secured using AWS WAF, Amazon CloudFront, and AWS Shield Standard, with user management seamlessly handled by Amazon Cognito, is invaluable. Additionally, the platform’s role-based access control (RBAC) is managed through AWS Identity and Access Management (IAM), applying least-privilege policies for enhanced security.

The DREM solution is built entirely using AWS Cloud-centered technologies, delivering inherent performance efficiency and reliability. When Eviden is not actively hosting events, there is minimal ongoing activity in the DREM environment, consisting primarily of standing items such as AWS WAF rules, CloudFront distributions, Amazon Simple Storage Service (Amazon S3) buckets, Amazon DynamoDB tables, and Systems Manager fleet configurations. However, during event periods, DREM seamlessly scales to use AWS Lambda, Amazon EventBridge, AWS Step Functions, and other serverless services, dynamically meeting the demands of the event hosting requirements.

Owing to the well-architected nature of the DREM solution, the platform is remarkably cost-effective. When not actively hosting events, the DREM environment incurs a minimal cost of around $6–8 per month, the majority of which is attributed to the AWS WAF protection. During event periods, the costs increase based on the number of users and models uploaded, but typically only rise to around $15 per month. To further optimize ongoing costs, DREM incorporates measures such as an Amazon S3 lifecycle policy that automatically removes uploaded models after a period of 2 weeks.

Conclusion

Are you interested in elevating your own AWS DeepRacer events and delivering a more engaging experience for your participants? We encourage you to explore the AWS DeepRacer Event Manager solution and see how it can transform your event management process.

To get started, visit the GitHub repo to learn more about the solution’s features and architecture. You can also reach out to the Eviden team or your local AWS Solutions Architect to discuss how DREM can be tailored to your specific event requirements.

Don’t miss out on the opportunity to take your AWS DeepRacer initiatives to the next level. Explore DREM and join us at an upcoming AWS DeepRacer event today!

About the authors

Sathya Paduchuri is a Senior Partner Solution Architect(PSA) at Amazon Web Services. Sathya helps partners run optimised workloads on AWS, build and develop their cloud practice(s) and develop new offerings.

Mark Ross is a Chief Architect at Eviden and has specialised in AWS for the past 8 years, gaining and maintaining all AWS certifications since 2021. Mark is passionate about helping customers build, migrate to and exploit AWS. Mark has created and grown a large AWS community within Eviden.

Generate unique images by fine-tuning Stable Diffusion XL with Amazon SageMaker

July 8, 2024

by Alen Zograbyan Amazon AWS

Stable Diffusion XL by Stability AI is a high-quality text-to-image deep learning model that allows you to generate professional-looking images in various styles. Managed versions of Stable Diffusion XL are already available to you on Amazon SageMaker JumpStart (see Use Stable Diffusion XL with Amazon SageMaker JumpStart in Amazon SageMaker Studio) and Amazon Bedrock (see Stable Diffusion XL in Amazon Bedrock), allowing you to produce creative content in minutes. The base version of Stable Diffusion XL 1.0 assists with the creative process using generic subjects in the image, which enables use cases such as game character design, creative concept generation, film storyboarding, and image upscaling. However, for use cases that require generating images with a unique subject, you can fine-tune Stable Diffusion XL with a custom dataset by using a custom training container with Amazon SageMaker. With this personalized image generation model, you can incorporate your custom subject into the powerful image generation process that is provided by the Stable Diffusion XL base model.

In this post, we provide step-by-step instructions to create a custom, fine-tuned Stable Diffusion XL model using SageMaker to generate unique images. This automated solution helps you get started quickly by providing all the code and configuration necessary to generate your unique images—all you need is images of your subject. This is useful for use cases across various domains such as media and entertainment, games, and retail. Examples include using your custom subject for marketing material for film, character creation for games, and brand-specific images for retail. To explore more AI use cases, visit the AI Use Case Explorer.

Solution overview

The solution is composed of three logical parts:

The first part creates a Docker container image with the necessary framework and configuration for the training container.
The second part uses the training container to perform model training on your dataset, and outputs a fine-tuned custom Low-Rank Adaptation (LoRA) model. LoRA is an efficient fine-tuning method that doesn’t require adjusting the base model parameters. Instead, it adds a smaller number of parameters that are applied to the base model temporarily.
The third part takes the fine-tuned custom model and allows you to generate creative and unique images.

The following diagram illustrates the solution architecture.

The workflow to create the training container consists of the following services:

SageMaker uses Docker containers throughout the ML lifecycle. SageMaker is flexible and allows you to bring your own container to use for model development, training, and inference. For this post, we build a custom container with the appropriate dependencies that will perform the fine-tuning.
Kohya SS is a framework that allows you to train Stable Diffusion models. Kohya SS works with different host environments. This solution uses the Docker on Linux environment option. Kohya SS can be used with a GUI. However, this solution uses the equivalent GUI parameters as a pre-configured TOML file to automate the entire Stable Diffusion XL fine-tuning process.
AWS CodeCommit is a fully managed source control service that hosts private Git repositories. We use CodeCommit to store the code that is necessary to build the training container (Dockerfile, buildspec.yml), and the training script (train.py) that is invoked when model training is initiated.
Amazon EventBridge is a serverless event bus, used to receive, filter, and route events. EventBridge captures any changes to the CodeCommit repository files, and invokes a new Docker container image to be built.
Amazon Elastic Container Registry (Amazon ECR) is a fully managed container hosting registry. We use it to store the custom training container image.
AWS CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces deployable software packages. We use it to build the custom training container image. CodeBuild then pushes this image to Amazon ECR.

Various methods exist to fine-tune your model. Compared to methods that require training a new full model, the LoRA fine-tuning method doesn’t modify the original model. Instead, think of it as a layer on top of the base model. Not having to train and produce a full model for each subject has its advantages. This lowers the compute requirements for training, reduces the storage size of the models, and decreases the training time required, making the process more cost-effective at scale. In this post, we demonstrate how to create a LoRA model, based on the Stable Diffusion XL 1.0 base model, using your own subject.

The training workflow uses the following services and features:

Amazon Simple Storage Service (Amazon S3) is a highly durable and scalable object store. Your custom dataset and configuration file will be uploaded to Amazon S3, and then retrieved by the custom Docker container to train on those images.
Amazon SageMaker Model Training is a feature of SageMaker that allows you to standardize and manage your training jobs at scale, without the need to manage infrastructure. When the container starts up as part of a training job, the train.py file is invoked. When the training process is complete, the output model that resides in the /opt/ml/model directory is automatically uploaded to the S3 bucket specified in the training job configuration.
Amazon SageMaker Pipelines is a workflow orchestration service that allows you to automate ML processes, from data preprocessing to model monitoring. This allows you to initiate a training pipeline, taking in as input the Amazon S3 location of your dataset and configuration file, ECR container image, and infrastructure specifications for the training job.

Now you’re ready to prompt your fine-tuned model to generate unique images. SageMaker gives you the flexibility to bring your own container for inference. You can use SageMaker hosting services with your own custom inference container to configure an inference endpoint. However, to demonstrate the Automatic1111 Stable Diffusion UI, we show you how to run inference on an Amazon Elastic Compute Cloud (Amazon EC2) instance (or locally on your own machine).

This solution fully automates the creation of a fine-tuned LoRA model with Stable Diffusion XL 1.0 as the base model. In the following sections, we discuss how to satisfy the prerequisites, download the code, and use the Jupyter notebook in the GitHub repository to deploy the automated solution using an Amazon SageMaker Studio environment.

The code for this end-to-end solution is available in the GitHub repository.

Prerequisites

This solution has been tested in the AWS Region us-west-2, but applies to any Region where these services are available. Make sure you have the following prerequisites:

Download the necessary code in SageMaker Studio

In this section, we walk through the steps to download the necessary code in SageMaker Studio and set up your notebook.

Navigate to the terminal in SageMaker Studio JupyterLab

Complete the following steps to open the terminal:

Log in to your AWS account and open the SageMaker Studio console.
Select your user profile and choose Open Studio to open SageMaker Studio.
Choose JupyterLab to open the JupyterLab application. This environment is where you will run the commands.
If you already have a space created, choose Run to open the space.
If you don’t have a space, choose Create JupyterLab space. Enter a name for the space and choose Create space. Leave the default values and choose Run space.
When the environment shows a status of Running, choose Open JupyterLab to open the new space.
In the JupyterLab Launcher window, choose Terminal.

Download the code to your SageMaker Studio environment

Run the following commands from the terminal. For this post, you check out just the required directories of the GitHub repo (so you don’t have to download the entire repository).

git clone --no-checkout https://github.com/aws/amazon-sagemaker-examples.git
cd amazon-sagemaker-examples/
git sparse-checkout set use-cases/text-to-image-fine-tuning
git checkout

If successful, you should see the output Your branch is up to date with 'origin/main'.

Open the notebook in SageMaker Studio JupyterLab

Complete the following steps to open the notebook:

In JupyterLab, choose File Browser in the navigation pane.
Navigate to the project directory named amazon-sagemaker-examples/use-cases/text-to-image-fine-tuning.
Open the Jupyter notebook named kohya-ss-fine-tuning.ipynb.
Choose your runtime kernel (it’s set to use Python 3 by default).
Choose Select.

You now have a kernel that is ready to run commands. In the following steps, we use this notebook to create the necessary resources.

Train a custom Stable Diffusion XL model

In this section, we walk through the steps to train a custom Stable Diffusion XL model.

Set up AWS infrastructure with AWS CloudFormation

For your convenience, an AWS CloudFormation template has been provided to create the necessary AWS resources. Before you create the resources, configure AWS Identity and Access Management (IAM) permissions for your SageMaker IAM role. This role is used by the SageMaker environment, and grants permissions to run certain actions. As with all permissions, make sure you follow the best practice of only granting the permissions necessary to perform your tasks.

On the IAM console, choose Roles in the navigation pane.
Choose the role named AmazonSageMaker-ExecutionRole-<id>. This should be the role that is assigned to your domain.
In the Permissions policies section, choose the policy named AmazonSageMaker-ExecutionPolicy-<id>.
Choose Edit to edit the customer managed policy.
Add the following permissions to the policy, then choose Next.
Choose Save changes to confirm your added permissions.

You now have the proper permissions to run commands in your SageMaker environment.

Navigate back to your notebook named kohya-ss-fine-tuning.ipynb in your JupyterLab environment.
In the notebook step labeled Step One – Create the necessary resources through AWS CloudFormation, run the code cell to create the CloudFormation stack.

Wait for the CloudFormation stack to finish creating before moving on. You can monitor the status of the stack creation on the AWS CloudFormation console. This step should take about 2 minutes.

Set up your custom images and fine-tuning configuration file

In this section, you first upload your fine-tuning configuration file to Amazon S3. The configuration file is specific to the Kohya program. Its purpose is to specify the configuration settings programmatically rather than manually using the Kohya GUI.

This file is provided with opinionated values. You can modify the configuration file with different values if desired. For information about what the parameters mean, refer to LoRA training parameters. You will need to experiment to achieve the desired result. Some parameters rely on underlying hardware and GPU (for example, mixed_precision=bf16 or xformers). Make sure your training instance has the proper hardware configuration to support the parameters you select.

You also need to upload a set of images to Amazon S3. If you don’t have your own dataset and decide to use images from public sources, make sure to adhere to copyright and license restrictions.

The structure of the S3 bucket is as follows:

bucket/0001-dataset/kohya-sdxl-config.toml

bucket/0001-dataset/<asset-folder-name>/ (images and caption files go here)

bucket/0002-dataset/kohya-sdxl-config.toml

bucket/0002-dataset/<asset-folder-name>/ (images and captions files go here)

...

The asset-folder-name uses a special naming convention, which is defined later in this post. Each xxxx-dataset prefix can contain separate datasets with different config file contents. Each pipeline takes a single dataset as input. The config file and asset folder will be downloaded by the SageMaker training job during the training step.

Complete the following steps:

Navigate back to your notebook named kohya-ss-fine-tuning.ipynb in your JupyterLab environment.
In the Notebook step labeled Step Two – Upload the fine-tuning configuration file, run the code cell to upload the config file to Amazon S3.
Verify that you have an S3 bucket named sagemaker-kohya-ss-fine-tuning-<account id>, with a 0001-dataset prefix containing the kohya-sdxl-config.tomlfile.

Next, you create an asset folder and upload your custom images and caption files to Amazon S3. The asset-folder-name must be named according to the required naming convention. This naming convention is what defines the number of repetitions and the trigger word for the prompt. The trigger word is what identifies your custom subject. For example, a folder name of 60_dwjz signifies 60 repetitions with the trigger prompt word dwjz. Consider using initials or abbreviations of your subject for the trigger word so it doesn’t collide with existing words. For example, if your subject is a tiger, you could use the trigger word tgr. More repetitions don’t always translate to better results. Experiment to achieve your desired result.

On the S3 console, navigate to the bucket named sagemaker-kohya-ss-fine-tuning-<account id>.
Choose the prefix named 0001-dataset.
Choose Create folder.
Enter a folder name for your assets using the naming convention (for example, 60_dwjz) and choose Create folder.
Choose the prefix. This is where your images and caption files go.
Choose Upload.
Choose Add files, choose your image files, then choose Upload.

When selecting images to use, favor quality over quantity. Some preprocessing of your image assets might be beneficial, such as cropping a person if you are fine-tuning a human subject. For this example, we used approximately 30 images for a human subject with great results. Most of them were high resolution, and cropped to include the human subject only—head and shoulders, half body, and full body images were included but not required.

Optionally, you can use caption files to assist your model in understanding your prompts better. Caption files have the .caption extension, and its contents describe the image (for example, dwjz wearing a vest and sunglasses, serious facial expression, headshot, 50mm). The image file names should match the corresponding (optional) caption file names. Caption files are highly encouraged. Upload your caption files to the same prefix as your images.

At the end of your upload, your S3 prefix structure should look similar to the following:

bucket/0001-dataset/kohya-sdxl-config.toml

bucket/0001-dataset/60_dwjz/

bucket/0001-dataset/60_dwjz/1.jpg

bucket/0001-dataset/60_dwjz/1.caption

bucket/0001-dataset/60_dwjz/2.jpg

bucket/0001-dataset/60_dwjz/2.caption

...

There are many variables to fine-tuning, and as of this writing there are no definitive recommendations for generating great results. To achieve good results, include enough steps in the training, good resolution assets, and enough images.

Set up the required code

The code required for this solution is provided and will be uploaded to the CodeCommit repository that was created by the CloudFormation template. This code is used to build the custom training container. Any updates to the code in this repository will invoke the container image to be built and pushed to Amazon ECR through an EventBridge rule.

The code consists of the following components:

buildspec.yml – Creates the container image by using the GitHub repository for Kohya SS, and pushes the training image to Amazon ECR
Dockerfile – Used to override the Dockerfile in the Kohya SS project, which is slightly modified to be used with SageMaker training
train.py – Initiates the Kohya SS program to do the fine-tuning, and is invoked when the SageMaker training job runs

Complete the following steps to create the training container image:

Navigate back to your notebook named kohya-ss-fine-tuning.ipynb in your JupyterLab environment.
In the step labeled Step Three – Upload the necessary code to the AWS CodeCommit repository, run the code cell to upload the required code to the CodeCommit repository.

This event will initiate the process that creates the training container image and uploads the image to Amazon ECR.

On the CodeBuild console, locate the project named kohya-ss-fine-tuning-build-container.

Latest build status should display as In progress. Wait for the build to finish and the status to change to Succeeded. The build takes about 15 minutes.

A new training container image is now available in Amazon ECR. Every time you make a change to the code in the CodeCommit repository, a new container image will be created.

Initiate the model training

Now that you have a training container image, you can use SageMaker Pipelines with a training step to train your model. SageMaker Pipelines enables you to build powerful multi-step pipelines. There are many step types provided for you to extend and orchestrate your workflows, allowing you to evaluate models, register models, consider conditional logic, run custom code, and more. The following steps are used in this pipeline:

Condition step – Evaluate input parameters. If successful, proceed with the training step. If not successful, proceed with the fail step. This step validates that the training volume size is at least 50 GB. You could extend this logic to only allow specific instance types, to only allow specific training containers, and add other guardrails if applicable.
Training Step – Run a SageMaker training job, given the input parameters.
Fail step – Stop the pipeline and return an error message.

Complete the following steps to initiate model training:

On the SageMaker Studio console, in the navigation pane, choose Pipelines.
Choose the pipeline named kohya-ss-fine-tuning-pipeline.
Choose Create to create a pipeline run.
Enter a name, description (optional), and any desired parameter values.
You can keep the default settings of using the 0001-dataset for the input data and an ml.g5.8xlarge instance type for training.
Choose Create to invoke the pipeline.

Choose the current pipeline run to view its details.
In the graph, choose the pipeline step named TrainNewFineTunedModel to access the pipeline run information.

The Details tab displays metadata, logs, and the associated training job. The Overview tab displays the output model location in Amazon S3 when training is complete (note this Amazon S3 location for use in later steps). SageMaker processes the training output by uploading the model in the /opt/ml/model directory of the training container to Amazon S3, in the location specified by the training job.

Wait for the pipeline status to show as Succeeded before proceeding to the next step.

Run inference on a custom Stable Diffusion XL model

There are many options for model hosting. For this post, we demonstrate how to run inference with Automatic1111 Stable Diffusion web UI running on an EC2 instance. This tool enables you to use various image generation features through a user interface. It’s a straightforward way to learn the parameters available in a visual format and experiment with supplementary features. For this reason, we demonstrate using this tool as part of this post. However, you can also use SageMaker to host an inference endpoint, and you have the option to use your own custom inference container.

Install the Automatic1111 Stable Diffusion web UI on Amazon EC2

Complete the following steps to install the web UI:

Create an EC2 Windows instance and connect to it. For instructions, see Get started with Amazon EC2.
Choose Windows Server 2022 Base Amazon Machine Image, a g5.8xlarge instance type, a key pair, and 100 GiB of storage. Alternatively, you can use your local machine.
Install NVIDIA drivers to enable the GPU. This solution has been tested with the Data Center Driver for Windows version 551.78.
Install the Automatic1111 Stable Diffusion web UI using the instructions in the Automatic Installation on Windows section in the GitHub repo. This solution has been tested with version 1.9.3. The last step of installation will ask you to run webui-user.bat, which will install and launch the Stable Diffusion UI in a web browser.

Download the Stable Diffusion XL 1.0 Base model from Hugging Face.
Move the downloaded file sd_xl_base_1.0.safetensors to the directory ../stable-diffusion-webui/models/Stable-diffusion/.
Scroll to the bottom of the page and choose Reload UI.
Choose sd_xl_base_1.0.safetensors on the Stable Diffusion checkpoint dropdown menu.
Adjust the default Width and Height values to 1024 x 1024 for better results.
Experiment with the remaining parameters to achieve your desired result. Specifically, try adjusting the settings for Sampling method, Sampling steps, CFG Scale, and Seed.

The input prompt is extremely important to achieve great results. You can add extensions to assist with your creative workflow. This style selector extension is great at supplementing prompts.

To install this extension, navigate to the Extensions tab, choose Install from URL, enter the style selector extension URL, and choose Install.
Reload the UI for changes to take effect.

You will notice a new section called SDXL Styles, which you can select from to add to your prompts.

Download the fine-tuned model that was created by the SageMaker pipeline training step.

The model is stored in Amazon S3 with the file name model.tar.gz.

You can use the Share with a presigned URL option to share as well.

Unzip the contents of the model.tar.gz file (twice) and copy the custom_lora_model.safetensors LoRA model file to the directory ../stable-diffusion-webui/models/Lora.
Choose the Refresh icon on the Lora tab to verify that your custom_lora_model is available.

Choose custom_lora_model, and it will populate the prompt input box with the text <lora:custom_lora_model:1>.
Append a prompt to the text (see examples in the next section).
You can decrease or increase the multiplier of your LoRA model by changing the 1 value. This adjusts the influence of your LoRA model accordingly.
Choose Generate to run inference against your fine-tuned LoRA model.

Example results

These results are from a fine-tuned model trained on 39 high-resolution images of the author, using the provided code and configuration files in this solution. Caption files were written for each of these images, using the trigger word aallzz.

	Prompt: concept art <lora:custom_lora_model:1.0> aallzz professional headshot, cinematic, bokeh, dramatic lighting, shallow depth of field, vignette, highly detailed, high budget, 8k, cinemascope, moody, epic, gorgeous, digital artwork, illustrative, painterly, matte painting Negative Prompt: photo, photorealistic, realism, anime, abstract, glitch Sampler: DPM2 Sampling Steps: 90 CFG Scale: 8.5 Width/Height: 1024×1024
	Prompt: cinematic film still <lora:custom_lora_model:1> aallzz eating a burger, cinematic, bokeh, dramatic lighting, shallow depth of field, vignette, highly detailed, high budget, cinemascope, moody, epic, gorgeous, film grain, grainy Negative Prompt: anime, cartoon, graphic, painting, graphite, abstract, glitch, mutated, disfigured Sampler: DPM2 Sampling Steps: 70 CFG Scale: 8 Width/Height: 1024×1024
	Prompt: concept art <lora:custom_lora_model:1> aallzz 3D profile picture avatar, vector icon, character, mountain background, sun backlight, digital artwork, illustrative, painterly, matte painting, highly detailed Negative Prompt: photo, photorealistic, realism, glitch, mutated, disfigured, glasses Sampler: DPM2 Sampling Steps: 100 CFG Scale: 9 Width/Height: 1024×1024
	Prompt: concept art <lora:custom_lora_model:1> aallzz 3D profile picture avatar, vector icon, vector illustration, vector art, realistic cartoon character, professional attire, digital artwork, illustrative, painterly, matte painting, highly detailed Negative Prompt: photo, photorealistic, realism, glitch, mutated, disfigured, glasses, hat Sampler: DPM2 Sampling Steps: 100 CFG Scale: 10 Width/Height: 1024×1024
	Prompt: cinematic photo <lora:custom_lora_model:1> aallzz portrait, sitting, magical elephant with large tusks, wearing safari clothing, majestic scenery in the background, river, natural lighting, 50mm, highly detailed, photograph, film, bokeh, professional, 4k, highly detailed Negative Prompt: drawing, painting, crayon, sketch, graphite, impressionist, noisy, blurry, soft, deformed, glitch, mutated, disfigured, glasses, hat Sampler: DPM2 Sampling Steps: 100 CFG Scale: 9.5 Width/Height: 1024×1024

Clean up

To avoid incurring charges, delete the resources you created as part of this solution:

Delete the objects in your S3 bucket. You must delete the objects before deleting the stack.
Delete your container image in Amazon ECR. You must delete the image before deleting the stack.
On the AWS CloudFormation console, delete the stack named kohya-ss-fine-tuning-stack.
If you created an EC2 instance for running inference, stop or delete the instance.
Stop or delete your SageMaker Studio instances, applications, and spaces.

Conclusion

Congratulations! You have successfully fine-tuned a custom LoRA model to be used with Stable Diffusion XL 1.0. We created a custom training Docker container, fine-tuned a custom LoRA model to be used with Stable Diffusion XL, and used the resulting model to generate creative and unique images. The end-to-end training solution was fully automated with a CloudFormation template to help you get started quickly. Now, try creating a custom model with your own subject. To explore more AI use cases, visit the AI Use Case Explorer.

About the Author

Alen Zograbyan is a Sr. Solutions Architect at Amazon Web Services. He currently serves media and entertainment customers, and has expertise in software engineering, DevOps, security, and AI/ML. He has a deep passion for learning, teaching, and photography.

Introducing the Amazon Trusted AI Challenge

July 8, 2024

by Amazon AWS

University students will compete for cash prizes in a competition to securely advance LLMs that code.Read More

Generative AI-powered apps transform business as usual

­New features and capabilities supercharge Amazon Bedrock—speeding development of generative AI apps

New partners and trainings help customers along the AI journey

About the author

Launched new tools and capabilities to build and scale generative AI safely, supported by adversarial style testing (i.e., red teaming)

Introduced watermarking to enable users to determine if visual content is AI-generated

Promoted collaboration among companies and governments regarding trust and safety risks

Used AI as a force for good to address society’s greatest challenges and supported initiatives that foster education

Delivering groundbreaking innovation with trust at the forefront

About the author

Solution overview

Prerequisites

Prepare the data

Fine-tune the model

Run the fine-tuning job on the Amazon Bedrock console

Run the fine-tuning job using the Amazon Bedrock API

Deploy and evaluate the fine-tuned model

Deploy the fine-tuned model using the Amazon Bedrock console

Deploy the fine-tuned model using the Amazon Bedrock API

Conclusion

About the Authors

Using pre-optimized models in SageMaker JumpStart

Deploying a pre-optimized model with the SageMaker Python SDK

Using the inference optimization toolkit to create custom optimizations

Compilation from SageMaker JumpStart

Compilation using the SageMaker Python SDK

Quantization and speculative decoding from SageMaker JumpStart

Quantization and speculative decoding using the SageMaker Python SDK

Conclusion

About the Authors

Benefits of the inference optimization toolkit

Speculative decoding

Quantization

Compilation

Pricing

Conclusion

About the authors

Limitations of LLM evaluations

S&P AI Benchmarks by Kensho

The evaluation tasks

Quantitative reasoning

Quantity extraction

Domain knowledge

Anthropic Claude 3.5 Sonnet on Amazon Bedrock

Get started with Anthropic Claude 3.5 Sonnet on Amazon Bedrock

Conclusion

About the authors

The need for MLOps at TWCo

Solution overview

SageMaker project template

Dependencies

Conclusion

About the Authors

Introducing AWS DeepRacer Event Manager

Racer experience

Event staff experience

Spectator experience

Well-architected

Conclusion

About the authors

Solution overview

Prerequisites

Download the necessary code in SageMaker Studio

Navigate to the terminal in SageMaker Studio JupyterLab

Download the code to your SageMaker Studio environment

Open the notebook in SageMaker Studio JupyterLab

Train a custom Stable Diffusion XL model

Set up AWS infrastructure with AWS CloudFormation

Set up your custom images and fine-tuning configuration file

Set up the required code

Initiate the model training

Run inference on a custom Stable Diffusion XL model

Install the Automatic1111 Stable Diffusion web UI on Amazon EC2

Example results

Clean up

Conclusion

About the Author

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

New features and capabilities supercharge Amazon Bedrock—speeding development of generative AI apps