Building Generative AI and ML solutions faster with AI apps from AWS partners using Amazon SageMaker

Building Generative AI and ML solutions faster with AI apps from AWS partners using Amazon SageMaker

Organizations of every size and across every industry are looking to use generative AI to fundamentally transform the business landscape with reimagined customer experiences, increased employee productivity, new levels of creativity, and optimized business processes. A recent study by Telecom Advisory Services, a globally recognized research and consulting firm that specializes in economic impact studies, shows that cloud-enabled AI will add more than $1 trillion to global GDP from 2024 to 2030.

Organizations are looking to accelerate the process of building new AI solutions. They use fully managed services such as Amazon SageMaker AI to build, train and deploy generative AI models. Oftentimes, they also want to integrate their choice of purpose-built AI development tools to build their models on SageMaker AI.

However, the process of identifying appropriate applications is complex and demanding, requiring significant effort to make sure that the selected application meets an organization’s specific business needs. Deploying, upgrading, managing, and scaling the selected application also demands considerable time and effort. To adhere to rigorous security and compliance protocols, organizations also need their data to stay within the confines of their security boundaries without the need to store it in a software as a service (SaaS) provider-owned infrastructure.

This increases the time it takes for customers to go from data to insights. Our customers want a simple and secure way to find the best applications, integrate the selected applications into their machine learning (ML) and generative AI development environment, manage and scale their AI projects.

Introducing Amazon SageMaker partner AI apps

Today, we’re excited to announce that AI apps from AWS Partners are now available in SageMaker. You can now find, deploy, and use these AI apps privately and securely, all without leaving SageMaker AI, so you can develop performant AI models faster.

Industry-leading app providers

The first group of partners and applications—shown in the following figure—that we’re including are Comet and its model experiment tracking application, Deepchecks and its large language model (LLM) quality and evaluation application, Fiddler and its model observability application, and Lakera and its AI security application.

Managed and secure

These applications are fully managed by SageMaker AI, so customers don’t have to worry about provisioning, scaling, and maintaining the underlying infrastructure. SageMaker AI makes sure that sensitive data stays completely within each customer’s SageMaker environment and will never be shared with a third party.

Available in SageMaker AI and SageMaker Unified Studio (preview)

Data scientists and ML engineers can access these applications from Amazon SageMaker AI (formerly known as Amazon SageMaker) and from SageMaker Unified Studio. This capability enables data scientists and ML engineers to seamlessly access the tools they require, enhancing their productivity and accelerating the development and deployment of AI products. It also empowers data scientists and ML engineers to do more with their models by collaborating seamlessly with their colleagues in data and analytics teams.

Seamless workflow integration

Direct integration with SageMaker AI provides a smooth user experience, from model building and deployment to ongoing production monitoring, all within your SageMaker development environment. For example, a data scientist can run experiments in their SageMaker Studio or SageMaker Unified Studio Jupyter notebook and then use the Comet ML app for visualizing and comparing those experiments.

Streamlined access

Use AWS credits to use partner apps without navigating lengthy procurement or approval processes, accelerating adoption and scaling of AI observability.

Application deep dive

The integration of these AI apps within SageMaker Studio enables you to build AI models and solutions without leaving your SageMaker development environment. Let’s take a look at the initial group of apps launched at re:Invent 2024.

Comet

Comet provides an end-to-end model evaluation solution for AI developers with best-in-class tooling for experiment tracking and model production monitoring. Comet has been trusted by enterprise customers and academic teams since 2017. Within SageMaker Studio, Notebooks and Pipelines, data scientists, ML engineers, and AI researchers can use Comet’s robust tracking and monitoring capabilities to oversee model lifecycles from training through production, bringing transparency and reproducibility to ML workflows.

You can access the Comet UI directly from SageMaker Studio and SageMaker Unified Studio without the need to provide additional credentials. The app infrastructure is deployed, managed, and supported by AWS, providing a holistic experience and seamless integration. This means each Comet deployment through SageMaker AI is securely isolated and provisioned automatically. You can seamlessly integrate Comet’s advanced tools without altering their existing your SageMaker AI workflows. To learn more, visit Comet.

Deepchecks

Deepchecks specializes in LLM evaluation. Their validation capabilities include automatic scoring, version comparison, and auto-calculated metrics for properties such as relevance, coverage, and grounded-in-context. These capabilities enable organizations to rigorously test, monitor, and improve their LLM applications while maintaining complete data sovereignty.

Deepchecks’s state-of-the-art automatic scoring capabilities for LLM applications, paired with the infrastructure and purpose-built tools provided by SageMaker AI for each step of the ML and FM lifecycle, makes it possible for AI teams to improve their models’ quality and compliance.

Starting today, organizations using AWS can immediately work with Deepchecks’s LLM evaluation tools in their environment, minimizing security and privacy concerns because data remains fully contained within their AWS environments. This integration also removes the overhead of onboarding a third-party vendor, because legal and procurement aspects are streamlined by AWS. To learn more, visit Deepchecks.

Fiddler AI

The Fiddler AI Observability solution allows data science, engineering, and line-of-business teams to validate, monitor, analyze, and improve ML models deployed on SageMaker AI.

With Fiddler’s advanced capabilities, users can track model performance, monitor for data drift and integrity, and receive alerts for immediate diagnostics and root cause analysis. This proactive approach allows teams to quickly resolve issues, continuously improving model reliability and performance. To learn more, visit Fiddler.

Lakera

Lakera partners with enterprises and high-growth technology companies to unlock their generative AI transformation. Lakera’s application Lakera Guard provides real-time visibility, protection, and control for generative AI applications. By protecting sensitive data, mitigating prompt attacks, and creating guardrails, Lakera Guard makes sure that your generative AI always interacts as expected.

Starting today, you can set up a dedicated instance of Lakera Guard within SageMaker AI that ensures data privacy and delivers low-latency performance, with the flexibility to scale alongside your generative AI application’s evolving needs. To learn more, visit Lakera.

See how customers are using partner apps

“The AI/ML team at Natwest Group leverages SageMaker and Comet to rapidly develop customer solutions, from swift fraud detection to in-depth analysis of customer interactions. With Comet now a SageMaker partner app, we streamline our tech and enhance our developers’ workflow, improving experiment tracking and model monitoring. This leads to better results and experiences for our customers.”
– Greig Cowan, Head of AI and Data Science, NatWest Group.

“Amazon SageMaker plays a pivotal role in the development and operation of Ping Identity’s homegrown AI and ML infrastructure. The SageMaker partner AI apps capability will enable us to deliver faster, more effective ML-powered functionality to our customers as a private, fully managed service, supporting our strict security and privacy requirements while reducing operational overhead.”
– Ran Wasserman, Principal Architect, Ping Identity.

Start building with AI apps from AWS partners

Amazon SageMaker AI provides access to a highly curated selection of apps from industry leading providers that are designed and certified to run natively and privately on SageMaker AI. Data scientists and developers can quickly find, deploy, and use these applications within SageMaker AI and the new unified studio to accelerate their ML and generative AI model building journey.

You can access all available SageMaker partner AI apps directly from SageMaker AI and SageMaker Unified Studio. Click through to view a specific app’s functionality, licensing terms, and estimated costs for deployment. After subscribing, you can configure the infrastructure that your app will run on by selecting a deployment tier and additional configuration parameters. After the app finishes the provisioning process, you will be able to assign access to your users, who will find the app ready to use in their SageMaker Studio and SageMaker Unified Studio environments.


About the authors

Gwen Chen is a Senior Generative AI Product Marketing Manager at AWS. She started working on AI products in 2018. Gwen has launched an NLP-powered app building product, MLOps, generative AI-powered assistants for data integration and model building, and inference capabilities. Gwen graduated from a dual master degree program of science and business with Duke and UNC Kenan-Flagler. Gwen likes listening to podcasts, skiing, and dancing.

Naufal Mir is a Senior Generative AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy, and migrate ML workloads to SageMaker. He previously worked at financial services institutes developing and operating systems at scale. He enjoys ultra-endurance running and cycling.

Kunal Jha is a Senior Product Manager at AWS. He is focused on building Amazon SageMaker Studio as the IDE of choice for all ML development steps. In his spare time, Kunal enjoys skiing, scuba diving and exploring the Pacific Northwest. You can find him on LinkedIn.

Eric Peña is a Senior Technical Product Manager in the AWS Artificial Intelligence Platforms team, working on Amazon SageMaker Interactive Machine Learning. He currently focuses on IDE integrations on SageMaker Studio. He holds an MBA degree from MIT Sloan and outside of work enjoys playing basketball and football.

Arkaprava De is a manager leading the SageMaker Studio Apps team at AWS. He has been at Amazon for over 9 years and is currently working on improving the Amazon SageMaker Studio IDE experience. You can find him on LinkedIn.

Zuoyuan Huang is a Software Development Manager at AWS. He has been at Amazon for over 5 years, and has been focusing on building SageMaker Studio apps and IDE experience. You can find him on LinkedIn.

Read More

Query structured data from Amazon Q Business using Amazon QuickSight integration

Query structured data from Amazon Q Business using Amazon QuickSight integration

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Although generative AI is fueling transformative innovations, enterprises may still experience sharply divided data silos when it comes to enterprise knowledge, in particular between unstructured content (such as PDFs, Word documents, and HTML pages), and structured data (real-time data and reports stored in databases or data lakes). Both categories of data are typically queried and accessed using separate tools, from in-product browse and search functionality for unstructured data, to business intelligence (BI) tools like Amazon QuickSight for structured content.

Amazon Q Business offers an effective solution for quickly building conversational applications over unstructured content, with over 40 data connectors to popular content and storage management systems such as Confluence, SharePoint, and Amazon Simple Storage Service (Amazon S3), to aggregate enterprise knowledge. Customers are also looking for a unified conversational experience across all their knowledge repositories, regardless of the format the content is stored and organized as.

On December 3, 2024, Amazon Q Business announced the launch of its integration with QuickSight, allowing you to quickly connect your structured sources to your Amazon Q Business applications, creating a unified conversational experience for your end-users. The QuickSight integration offers an extensive set of over 20 structured data source connectors, including Amazon Redshift, PostgreSQL, MySQL, and Oracle, enabling you to quickly expand the conversational scope of your Amazon Q Business assistants to cover a wider range of knowledge sources. For the end-users, answers are returned in real time from your structured sources, combined with other relevant information found in unstructured repositories. Amazon Q Business uses the analytics and advanced visualization engine in QuickSight to generate accurate and simple-to-understand answers from structured sources.

In this post, we show you how to configure the QuickSight connection from Amazon Q Business and then ask questions to get real-time data and visualizations from QuickSight for structured data in addition to unstructured content.

Solution overview

The QuickSight feature in Amazon Q Business is available on the Amazon Q Business console as well as through Amazon Q Business APIs. This feature is implemented as a plugin within Amazon Q Business. After it’s enabled, this plugin will behave differently than other Amazon Q Business plugins—it will query QuickSight automatically for every user prompt, looking for relevant answers.

For AWS accounts that aren’t subscribed to QuickSight already, the Amazon Q Business admin completes the following steps:

  1. Create a QuickSight account.
  2. Connect your database in QuickSight to create a dataset.
  3. Create a topic in QuickSight, which is then used to make it searchable from your Amazon Q Business application.

When the feature is activated, Amazon Q Business will use your unstructured data sources configured in Amazon Q Business, as well as your structured content available using QuickSight, to generate a rich answer that includes narrative and visualizations. Depending on the question and data in QuickSight, Amazon Q Business may generate one or more visualizations as a response.

Prerequisites

You should have the following prerequisites:

  • An AWS account where you can follow the instructions in this post.
  • AWS IAM Identity Center set up to be used with Amazon Q Business. For more information, see Configure Amazon Q Business with AWS IAM Identity Center trusted identity propagation.
  • At least one Amazon Q Business Pro user that has admin permissions to set up and configure Amazon Q Business. For pricing information, see Amazon Q Business pricing.
  • An IAM Identity Center group that will be assigned the QuickSight Admin Pro role, for users who will manage and configure QuickSight.
  • If a QuickSight account exists, then it needs to be in the same AWS account and AWS Region as Amazon Q Business, and configured with IAM Identity Center.
  • A database that is installed and can be reached from QuickSight to load structured data (or you could create a dataset by uploading a CSV or XLS file). The database also needs credentials to create tables and insert data.
  • Sample structured data to load into the database (along with insert statements).

Create an Amazon Q Business application

To use this feature, you need to have an Amazon Q Business application. If you don’t have an existing application, follow the steps in Discover insights from Amazon S3 with Amazon Q S3 connector to create an application along with an Amazon S3 data source. Upload the non-structured document(s) to Amazon S3 and sync the data source.

Create and configure a new QuickSight account

You can skip this section if you already have an existing QuickSight account. To create a QuickSight account, complete the following steps:

  1. On the Amazon Q Business console, navigate to your application.
  2. Choose Amazon QuickSight in the navigation pane.

  1. Choose Create QuickSight account.

  1. Under QuickSight account information, enter your account name and an email for account notifications.
  2. Under Assign QuickSight Admin Pro roles, choose the IAM Identity Center group you created as a prerequisite.
  3. Choose Next.

  1. Under Service access, select Create and use a new service role.
  2. Choose Authorize.

This will create a QuickSight account, assign the IAM Identity Center group as QuickSight Admin Pro, and authorize Amazon Q Business to access QuickSight.

You will see a dashboard with details for QuickSight. Currently, it will show zero datasets and topics.

  1. Choose Go to QuickSight.

You can now proceed to the next section to prepare your data.

Configure an existing QuickSight account

You can skip this section if you followed the previous steps and created a new QuickSight account.

If your current QuickSight account is not on IAM Identity Center, consider using a different AWS account without a QuickSight subscription for the purpose of testing this feature. From that account, you create an Amazon Q Business application on IAM Identity Center and go through the QuickSight integration setup steps on the Amazon Q Business console that will create the QuickSight account for you in IAM Identity Center. Remember to delete that new QuickSight account and Amazon Q Business application after your testing is done to avoid further billing.

Complete the following steps to set up the QuickSight connector from Amazon Q Business for an existing QuickSight account:

  1. On the Amazon Q Business console, navigate to your application.
  2. Choose Amazon QuickSight in the navigation pane.

  1. Choose Authorize QuickSight answers.

  1. Under Assign QuickSight Admin Pro roles, choose the IAM Identity Center group you created as a prerequisite.
  2. Under Service Access, select Create and use a new service role.
  3. Choose Save.

You will see a dashboard with details for QuickSight. If you already have a dataset and topics, they will show up here.

You’re now ready to add a dataset and topics in the next section.

Add data in QuickSight

In this section, we create an Amazon Redshift data source. You can instead create a data source from the database of your choice, use files in Amazon S3, or perform a direct upload of CSV files and connect to it. Refer to Creating a dataset from a database for more details.

To configure your data, complete the following steps:

  1. Create a new dataset with Amazon Redshift as a data source.

Configuring this connection offers multiple choices; choose the one that best fits your needs.

  1. Create a topic from the dataset. For more information, see Creating a topic.

  1. Optionally, create dashboards from the topic. If created, Amazon Q Business can use them.

Ask queries to Amazon Q Business

To start chatting with Amazon Q Business, complete the following steps:

  1. On the Amazon Q Business console, navigate to your application.
  2. Choose Amazon QuickSight in the navigation pane.

You should see the datasets and topics populated with values.

  1. Choose the link under Deployed URL.

We uploaded AWS Cost and Usage Reports for a specific AWS account in QuickSight using Amazon Redshift. We also uploaded Amazon service documentation into a data source using Amazon S3 into Amazon Q Business as unstructured data. We will ask questions related to our AWS costs and show how Amazon Q Business answers questions from both structured and unstructured data.

The following screenshot shows an example question that returns a response from only unstructured data.

The following screenshot shows an example question that returns a response from only structured data.

The following screenshot shows an example question that returns a response from both structured and unstructured data.

The following screenshot shows an example question that returns multiple visualizations from both structured and unstructured data.

Clean up

If you no longer want to use this Amazon Q Business feature, delete the resources you created to avoid future charges:

  1. Delete the Amazon Q Business application:
    1. On the Amazon Q Business console, choose Applications in the navigation pane.
    2. Select your application and on the Actions menu, choose Delete.
    3. Enter delete to confirm and choose Delete.

The process can take up to 15 minutes to complete.

  1. Delete the S3 bucket:
    1. Empty your S3 bucket.
    2. Delete the bucket.
  2. Delete the QuickSight account:
    1. On the Amazon QuickSight console, choose Manage Amazon QuickSight.
    2. Choose Account setting and Manage.
    3. Delete the account.
  3. Delete your IAM Identity Center instance.

Conclusion

In this post, we showed how to include answers from your structured sources in your Amazon Q Business applications, using the QuickSight integration. This creates a unified conversational experience for your end-users that saves them time, helps them make better decisions through more complete answers, and improves their productivity.

At AWS re:Invent 2024, we also announced a similar unified experience enabling access to insights from unstructured data sources in Amazon Q in QuickSight powered by Amazon Q Business.

To learn about the new capabilities Amazon Q in QuickSight provides, see QuickSight Plugin.

To learn more about Amazon Q Business, refer to the Amazon Q Business User Guide.

To learn more about configuring a QuickSight dataset, see Manage your Amazon QuickSight datasets more efficiently with the new user interface.

QuickSight also offers querying unstructured data. For more details, refer to Integrate unstructured data into Amazon QuickSight using Amazon Q Business.


About the authors

jdJiten Dedhia is a Sr. AIML Solutions Architect with over 20 years of experience in the software industry. He has helped Fortune 500 companies with their AIML/Generative AI needs.

jpdJean-Pierre Dodel is a Principal Product Manager for Amazon Q Business, responsible for delivering key strategic product capabilities including structured data support in Q Business, RAG. and overall product accuracy optimizations. He brings extensive AI/ML and Enterprise search experience to the team with over 7 years of product leadership at AWS.

Read More

Elevate customer experience by using the Amazon Q Business custom plugin for New Relic AI

Elevate customer experience by using the Amazon Q Business custom plugin for New Relic AI

Digital experience interruptions can harm customer satisfaction and business performance across industries. Application failures, slow load times, and service unavailability can lead to user frustration, decreased engagement, and revenue loss. The risk and impact of outages increase during peak usage periods, which vary by industry—from ecommerce sales events to financial quarter-ends or major product launches. According to New Relic’s 2024 Observability Forecast, businesses face a median annual downtime of 77 hours from high-impact outages. These outages can cost up to $1.9 million per hour.

New Relic is addressing these challenges by creating the New Relic AI custom plugin for Amazon Q Business. This custom plugin creates a unified solution that combines New Relic AI’s observability insights and recommendations and Amazon Q Business’s Retrieval Augmented Generation (RAG) capabilities, in and a natural language interface for east of use.

The custom plugin streamlines incident response, enhances decision-making, and reduces cognitive load from managing multiple tools and complex datasets. It empowers team members to interpret and act quickly on observability data, improving system reliability and customer experience. By using AI and New Relic’s comprehensive observability data, companies can help prevent issues, minimize incidents, reduce downtime, and maintain high-quality digital experiences.

This post explores the use case, how this custom plugin works, how it can be enabled, and how it can help elevate customers’ digital experiences.

The challenge: Resolving application problems before they impact customers

New Relic’s 2024 Observability Forecast highlights three key operational challenges:

  • Tool and context switching – Engineers use multiple monitoring tools, support desks, and documentation systems. 45% of support engineers, application engineers, and SREs use five different monitoring tools on average. This fragmentation can cause missed SLAs and SLOs, confusion during critical incidents, and increased negative fiscal impact. Tool switching slows decision-making during outages or ecommerce disruptions.
  • Knowledge accessibility – Scattered, hard-to-access knowledge, including runbooks and post-incident reports, hinders effective incident response. This can cause slow escalations, uncertain decisions, longer disruptions, and higher operational costs from redundant engineer involvement.
  • Complexity in data interpretation – Team members may struggle to interpret monitoring and observability data due to complex applications with numerous services and cloud infrastructure entities, and unclear symptom-problem relationships. This complexity hinders quick, accurate data analysis and informed decision-making during critical incidents.

The custom plugin for Amazon Q Business addresses these challenges with a unified, natural language interface for critical insights. It uses AI to research and translate findings into clear recommendations, providing quick access to indexed runbooks and post-incident reports. This custom plugin streamlines incident response, enhances decision-making, and reduces effort in managing multiple tools and complex datasets.

Solution Overview

The New Relic custom plugin for Amazon Q Business centralizes critical information and actions in one interface, streamlining your workflow. It allows you to inquire about specific services, hosts, or system components directly. For instance, you can investigate a sudden spike in web service response times or a slow database. NR AI responds by analyzing current performance data and comparing it to historical trends and best practices. It then delivers detailed insights and actionable recommendations based on up-to-date production environment information.

The following diagram illustrates the workflow.

Scope of Solution

When a user asks a question in the Amazon Q interface, such as “Show me problems with the checkout process,” Amazon Q queries the RAG ingested with the customers’ runbooks. Runbooks are troubleshooting guides maintained by operational teams to minimize application interruptions. Amazon Q gains contextual information, including the specific service names and infrastructure information related to the checkout service, and uses the custom plugin to communicate with New Relic AI. New Relic AI initiates a deep dive analysis of monitoring data since the checkout service problems began.

New Relic AI conducts a comprehensive analysis of the checkout service. It examines service performance metrics, forecasts of key indicators like error rates, error patterns and anomalies, security alerts, and overall system status and health. The analysis results in a summarized alert intelligence report that identifies and explains root causes of checkout service issues. This report provides clear, actionable recommendations and includes real-time application performance insights. It also offers direct links to detailed New Relic interfaces. Users can access this comprehensive summary without leaving the Amazon Q interface.

The custom plugin presents information and insights directly within the Amazon Q Business interface, eliminating the need to switch between the New Relic and Amazon Q interfaces, and enabling faster problem resolution.

Potential impacts

The New Relic Intelligent Observability platform provides comprehensive incident response and application and infrastructure performance monitoring capabilities for SREs, application engineers, support engineers, and DevOps professionals. Organizations using New Relic report significant improvements in their operations, achieving a 65% reduction in incidents, 10 times more deployments, and 50% faster release times while maintaining 99.99% uptime. When you combine New Relic insights with Amazon Q Business, you can further reduce incidents, deploy higher-quality code more frequently, and create more reliable experiences for your customers:

  • Detect and resolve incidents faster – With this custom plugin, you can reduce undetected incidents and resolve issues more quickly. Incidents often occur when teams miss early warning signs or can’t connect symptoms to underlying problems, leading to extended service disruptions. Although New Relic collects and generates data that can identify these warning signs, teams working in separate tools might not have access to these critical insights. For instance, support specialists might not have direct access to monitoring dashboards, making it challenging to identify emerging issues. The custom plugin consolidates these monitoring insights, helping you more effectively identify and understand related issues.
  • Simplify incident management – The custom plugin enhances support engineers’ and incident responders’ efficiency by streamlining their workflow. The custom plugin allows you to manage incidents without switching between New Relic AI and Amazon Q during critical moments. The integrated interface removes context switching, enabling both technical and non-technical users to access vital monitoring data quickly within the Amazon Q interface. This comprehensive approach speeds up troubleshooting, minimizes downtime, and boosts overall system reliability.
  • Build reliability across teams – The custom plugin makes application and infrastructure performance monitoring insights accessible to team members beyond traditional observability users. translates complex production telemetry data into clear, actionable insights for product managers, customer service specialists, and executives. By providing a unified interface for querying and resolving issues, it empowers your entire team to maintain and improve digital services, regardless of their technical expertise. For example, when a customer service specialist receives user complaints, they can quickly investigate application performance issues without navigating complex monitoring tools or interpreting alert conditions. This unified view enables everyone supporting your enterprise software to understand and act on insights about application health and performance. The result is a more collaborative approach across multiple enterprise teams, leading to more reliable system maintenance and excellent customer experiences.

Conclusion

The New Relic AI custom plugin represents a step forward in digital experience management. By addressing key challenges such as tool fragmentation, knowledge accessibility, and data complexity, this solution empowers teams to deliver superior digital experiences. This collaboration between AWS and New Relic opens up possibilities for building more robust digital infrastructures, advancing innovation in customer-facing technologies, and setting new benchmarks in proactive IT problem-solving.

To learn more about improving your operational efficiency with AI-powered observability, refer to the Amazon Q Business User Guide and explore New Relic AI capabilities. To get started on training, enroll for free Amazon Q training from AWS Training and Certification.

About New Relic

New Relic is a leading cloud-based observability platform that helps businesses optimize the performance and reliability of their digital systems. New Relic processes 3 EB of data annually. Over 5 billion data points are ingested and 2.4 trillion queries are executed every minute across 75,000 active customers. The platform serves over 333 billion web requests each day. The median platform response time is 60 milliseconds.


About the authors

 Meena Menon is a Sr. Customer Solutions Manager at AWS.

Sean Falconer is a Sr. Solutions Architect at AWS.

Nava Ajay Kanth Kota is a Senior Partner Solutions Architect at AWS. He is currently part of the Amazon Partner Network (APN) team that closely works with ISV Storage Partners. Prior to AWS, his experience includes running Storage, Backup, and Hybrid Cloud teams and his responsibilities included creating Managed Services offerings in these areas.

David Girling is a Senior AI/ML Solutions Architect with over 20 years of experience in designing, leading, and developing enterprise systems. David is part of a specialist team that focuses on helping customers learn, innovate, and utilize these highly capable services with their data for their use cases.

Camden Swita is Head of AI and ML Innovation at New Relic specializing in developing compound AI systems, agentic frameworks, and generative user experiences for complex data retrieval, analysis, and actioning.

Read More

Amazon SageMaker launches the updated inference optimization toolkit for generative AI

Amazon SageMaker launches the updated inference optimization toolkit for generative AI

Today, Amazon SageMaker is excited to announce updates to the inference optimization toolkit, providing new functionality and enhancements to help you optimize generative AI models even faster. These updates build on the capabilities introduced in the original launch of the inference optimization toolkit (to learn more, see Achieve up to ~2x higher throughput while reducing costs by ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 1).

The following are the key additions to the inference optimization toolkit:

  • Speculative decoding support for Meta Llama 3.1 models – The toolkit now supports speculative decoding for the latest Meta Llama 3.1 70B and 405B (FP8) text models, allowing you to accelerate inference process.
  • Support for FP8 quantization – The toolkit has been updated to enable FP8 (8-bit floating point) quantization, helping you further optimize model size and inference latency for GPUs. FP8 offers several advantages over FP32 (32-bit floating point) for deep learning model inference, including reduced memory usage, faster computation, lower power consumption, and broader applicability because FP8 quantization can be applied to key model components like the KV cache, attention, and MLP linear layers.
  • Compilation support for TensorRT-LLM – You can now use the toolkit’s compilation capabilities to integrate your generative AI models with NVIDIA’s TensorRT-LLM, delivering enhanced performance by optimizing the model with ahead-of-time compilation. You reduce the model’s deployment time and auto scaling latency because the model weights don’t require just-in-time compilation when the model deploys to a new instance.

These updates build on the toolkit’s existing capabilities, allowing you to reduce the time it takes to optimize generative AI models from months to hours, and achieve best-in-class performance for your use case. Simply choose from the available optimization techniques, apply them to your models, validate the improvements, and deploy the models in just a few clicks through SageMaker.

In this post, we discuss these new features of the toolkit in more detail.

Speculative decoding

Speculative decoding is an inference technique that aims to speed up the decoding process of large language models (LLMs) for latency-critical applications, without compromising the quality of the generated text. The key idea is to use a smaller, less powerful, but faster language model called the draft model to generate candidate tokens. These candidate tokens are then validated by the larger, more powerful, but slower target model. At each iteration, the draft model generates multiple candidate tokens. The target model verifies the tokens, and if it finds a particular token unacceptable, it rejects it and regenerates that token itself. This allows the larger target model to focus on verification, which is faster than auto-regressive token generation. The smaller draft model can quickly generate all the tokens and send them in batches to the target model for parallel evaluation, significantly speeding up the final response generation.

With the updated SageMaker inference toolkit, you get out-of-the-box support for speculative decoding that has been tested for performance at scale on various popular open source LLMs. The toolkit provides a pre-built draft model, eliminating the need to invest time and resources in building your own draft model from scratch. Alternatively, you can also use your own custom draft model, providing flexibility to accommodate your specific requirements. To showcase the benefits of speculative decoding, let’s look at the throughput (tokens per second) for a Meta Llama 3.1 70B Instruct model deployed on an ml.p4d.24xlarge instance using the Meta Llama 3.2 1B Instruct draft model.

Speculative decoding price

Given the increase in throughput that is realized with speculative decoding, we can also see the blended price difference when using speculative decoding vs. when not using speculative decoding. Here we have calculated the blended price as a 3:1 ratio of input to output tokens. The blended price is defined as follows:

  • Total throughput (tokens per second) = NumberOfOutputTokensPerRequest / (ClientLatency / 1,000) x concurrency
  • Blended price ($ per 1 million tokens) = (1−(discount rate)) × (instance per hour price) ÷ ((total token throughput per second) × 60 × 60 ÷ 10^6)) ÷ 4
  • Discount rate assuming a 26% Savings Plan

Speculative Decoding price

Quantization

Quantization is one of the most popular model compression methods to accelerate model inference. From a technical perspective, quantization has several benefits:

  • It reduces model size, which makes it suitable for deploying using fewer GPUs with lower total device memory available.
  • It reduces memory bandwidth pressure by using fewer-bit data types.
  • If offers increased space for the KV cache. This enables larger batch sizes and sequence lengths.
  • It significantly speeds up matrix multiplication (GEMM) operations on the NVIDIA architecture, for example, up to twofold for FP8 compared to the FP16/BF16 data type in microbenchmarks.

With this launch, the SageMaker inference optimization toolkit now supports FP8 and SmoothQuant (TensorRT-LLM only) quantization. SmoothQuant is a post-training quantization (PTQ) technique for LLMs that reduces memory and speeds up inference without sacrificing accuracy. It migrates quantization difficulty from activations to weights, which are easier to quantize. It does this by introducing a hyperparameter to calculate a per-channel scale that balances the quantization difficulty of activations and weights.

The current generation of instances like p5 and g6 provide support for FP8 using specialized tensor cores. FP8 represents float point numbers in 8 bits instead of the usual 16. At the time of writing, vLLM and TRT-LLM support quantizing the KV cache, attention, and linear layers for text-only LLMs. This reduces memory footprint, increases throughput, and lowers latency. Whereas both weights and activations can be quantized for p5 and g6 instances (W8A8), only weights can be quantized for p4d and g5 instances (W8A16). Though FP8 quantization has minimal impact on accuracy, you should always evaluate the quantized model on your data and for your use case. You can evaluate the quantized model through Amazon SageMaker Clarify. For more details, see Understand options for evaluating large language models with SageMaker Clarify.

The following graph compares the throughput of a FP8 quantized Meta Llama 3.1 70B Instruct model against a non-quantized Meta Llama 3.1 70B Instruct model on an ml.p4d.24xlarge instance.

Quantized vs base model throughput

The quantized model has a smaller memory footprint and it can be deployed to a smaller (and cheaper) instance type. In this post, we have deployed the quantized model on g5.12xlarge.

The following graph shows the price difference per million tokens between the FP8-quantized model deployed on g5.12xlarge and the non-quantized version deployed on p4d.24xlarge.

Quantized model price

Our analysis shows a clear price-performance edge for the FP8 quantized model over the non-quantized approach. However, quantization has an impact on model accuracy, so we strongly testing the quantized version of the model on your datasets.

The following is the SageMaker Python SDK code snippet for quantization. You just need to provide the quantization_config attribute in the optimize() function:

quantized_instance_type = "ml.g5.12xlarge"

output_path=f"s3://{artifacts_bucket_name}/llama-3-1-70b-fp8/"

optimized_model = model_builder.optimize(
    instance_type=quantized_instance_type,
    accept_eula=True,
    quantization_config={
        "OverrideEnvironment": {
            "OPTION_QUANTIZE": "fp8",
            "OPTION_TENSOR_PARALLEL_DEGREE": "4"
        },
    },
    output_path=output_path,
)

Refer to the following code example to learn more about how to enable FP8 quantization and speculative decoding using the optimization toolkit for a pre-trained Amazon SageMaker JumpStart model. If you want to deploy a fine-tuned model with SageMaker JumpStart using speculative decoding, refer to the following notebook.

Compilation

Compilation optimizes the model to extract the best available performance on the chosen hardware type, without any loss in accuracy. For compilation, the SageMaker inference optimization toolkit provides efficient loading and caching of optimized models to reduce model loading and auto scaling time by up to 40–60 % for Meta Llama 3 8B and 70B.

Model compilation enables running LLMs on accelerated hardware, such as GPUs, while simultaneously optimizing the model’s computational graph for optimal performance on the target hardware. When using the Large Model Inference (LMI) Deep Learning Container (DLC) with the TensorRT-LLM framework, the compiler is invoked from within the framework and creates compiled artifacts. These compiled artifacts are unique for a combination of input shapes, precision of the model, tensor parallel degree, and other framework- or compiler-level configurations. Although the compilation process avoids overhead during inference and enables optimized inference, it can take a lot of time.

To avoid re-compiling every time a model is deployed onto a GPU with the TensorRT-LLM framework, SageMaker introduces the following features:

  • A cache of pre-compiled artifacts – This includes popular models like Meta Llama 3.1. When using an optimized model with the compilation config, SageMaker automatically uses these cached artifacts when the configurations match.
  • Ahead-of-time compilation – The inference optimization toolkit enables you to compile your models with the desired configurations before deploying them on SageMaker.

The following graph illustrates the improvement in model loading time when using pre-compiled artifacts with the SageMaker LMI DLC. The models were compiled with a sequence length of 4096 and a batch size of 16, with Meta Llama 3.1 8B deployed on a g5.12xlarge (tensor parallel degree = 4) and Meta Llama 3.1 70B Instruct on a p4d.24xlarge (tensor parallel degree = 8). As you can see on the graph, the bigger the model, the bigger the benefit of using a pre-compiled model (16% improvement for Meta Llama 3 8B and 43% improvement for Meta Llama 3 70B).

Load times

Compilation using the SageMaker Python SDK

For the SageMaker Python SDK, you can configure the compilation by changing the environment variables in the .optimize() function. For more details on compilation_config, refer to TensorRT-LLM ahead-of-time compilation of models tutorial.

optimized_model = model_builder.optimize(
    instance_type=gpu_instance_type,
    accept_eula=True,
    compilation_config={
        "OverrideEnvironment": {
            "OPTION_ROLLING_BATCH": "trtllm",
            "OPTION_MAX_INPUT_LEN": "4096",
            "OPTION_MAX_OUTPUT_LEN": "4096",
            "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
            "OPTION_TENSOR_PARALLEL_DEGREE": "8",
        }
    },
    output_path=f"s3://{artifacts_bucket_name}/trtllm/",
)

Refer to the following notebook for more information on how to enable TensorRT-LLM compilation using the optimization toolkit for a pre-trained SageMaker JumpStart model.

Amazon SageMaker Studio UI experience

In this section, let’s walk through the Amazon SageMaker Studio UI experience to run an inference optimization job. In this case, we use the Meta Llama 3.1 70B Instruct model, and for the optimization option, we quantize the model using INT4-AWQ and then use the SageMaker JumpStart suggested draft model Meta Llama 3.2 1B Instruct for speculative decoding.

First, we search for the Meta Llama 3.1 70B Instruct model in the SageMaker JumpStart model hub and choose Optimize on the model card.

Studio-Optimize

The Create inference optimization job page provides you options to choose the type of optimization. In this case, we choose to take advantage of the benefits of both INT4-AWQ quantization and speculative decoding.

Studio Optimization Options

Chosing Optimization Options in Studio

For the draft model, you have a choice to use the SageMaker recommended draft model, choose one the SageMaker JumpStart models, or bring your own draft model.

Draft model options in Studio

For this scenario, we choose the SageMaker recommended Meta Llama 3.2 1B Instruct model as the draft model and start the optimization job.

Optimization job details

When the optimization job is complete, you have an option to evaluate performance or deploy the model onto a SageMaker endpoint for inference.

Inference Optimization Job deployment

Optimized Model Deployment

Pricing

For compilation and quantization jobs, SageMaker will optimally choose the right instance type, so you don’t have to spend time and effort. You will be charged based on the optimization instance used. To learn more, see Amazon SageMaker pricing. For speculative decoding, there is no additional optimization cost involved; the SageMaker inference optimization toolkit will package the right container and parameters for the deployment on your behalf.

Conclusion

To get started with the inference optimization toolkit, refer to Achieve up to 2x higher throughput while reducing cost by up to 50% for GenAI inference on SageMaker with new inference optimization toolkit: user guide – Part 2. This post will walk you through how to use the inference optimization toolkit when using SageMaker inference with SageMaker JumpStart and the SageMaker Python SDK. You can use the inference optimization toolkit with supported models on SageMaker JumpStart. For the full list of supported models, refer to Inference optimization for Amazon SageMaker models.


About the Authors

Marc KarpMarc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Dmitry SoldatkinDmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes.

RaghuRaghu Ramesha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Rishabh Ray Chaudhury is a Senior Product Manager with Amazon SageMaker, focusing on Machine Learning inference. He is passionate about innovating and building new experiences for Machine Learning customers on AWS to help scale their workloads. In his spare time, he enjoys traveling and cooking. You can find him on LinkedIn.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Read More

Syngenta develops a generative AI assistant to support sales representatives using Amazon Bedrock Agents

Syngenta develops a generative AI assistant to support sales representatives using Amazon Bedrock Agents

This post was written with Zach Marston and Serg Masis from Syngenta.

Syngenta and AWS collaborated to develop Cropwise AI, an innovative solution powered by Amazon Bedrock Agents, to accelerate their sales reps’ ability to place Syngenta seed products with growers across North America. Cropwise AI harnesses the power of generative AI using AWS to enhance Syngenta’s seed selection tools and streamline the decision-making process for farmers and sales representatives. This conversational agent offers a new intuitive way to access the extensive quantity of seed product information to enable seed recommendations, providing farmers and sales representatives with an additional tool to quickly retrieve relevant seed information, complementing their expertise and supporting collaborative, informed decision-making.

Generative AI is reshaping businesses and unlocking new opportunities across various industries. As a global leader in agriculture, Syngenta has led the charge in using data science and machine learning (ML) to elevate customer experiences with an unwavering commitment to innovation. Building on years of experience in deploying ML and computer vision to address complex challenges, Syngenta introduced applications like NemaDigital, Moth Counter, and Productivity Zones. Now, Syngenta is advancing further by using large language models (LLMs) and Amazon Bedrock Agents to implement Cropwise AI on AWS, marking a new era in agricultural technology.

In this post, we discuss Syngenta’s journey in developing Cropwise AI.

The business challenge

Syngenta offers a diverse portfolio of seed products in North America, reflecting the company’s commitment to growers meeting them at the field. The seed selection process involves careful consideration of multiple factors, including seed product characteristics, specific growing environments, and the unique practices and goals of each farmer. Many farmers struggle to navigate the complex decision-making process and might not always choose the most suitable seeds for their specific conditions, potentially impacting crop yields and profitability.

Recognizing an opportunity to improve this selection process, Syngenta collaborated with the AWS Generative AI Innovation Center, an AWS program that fosters collaboration between customers and AWS experts. This collaboration yielded Cropwise AI, which improves the efficiency of sales rep’s interactions with customers to suggest Syngenta seed products.

Cropwise AI

Cropwise AI transforms the seed selection process in several powerful ways. It simplifies complex charts and tables into a natural, conversational experience, providing users with quick, on-demand access to detailed seed product information on a mobile device. The tool also streamlines data navigation, allowing users to efficiently explore and compare Syngenta’s extensive seed catalogue. Additionally, Cropwise AI enables personalized recommendations at scale, tailoring seed choices to align with local conditions and specific farm needs, creating a more precise and accessible selection process.

Cropwise AI enhances the seed selection process by acting as a valuable companion to the expertise of sales representatives and the decision-making process of farmers. For sales representatives, it empowers them with deeper insights, enabling more informed recommendations. For farmers, it offers an accessible resource for exploring and comparing seed options. The following example shows the Cropwise AI mobile app in GHX 2.0 describing a common seed product (left) and the transcribed conversation (right).

Cropwise AI: Hello, Jack! What can I do for you?
Me: Tell me about the G11V76 hybrid of golden harvest
Cropwise AI: G11V76 is a corn hybrid developed by Golden Harvest. Let’s delve into its characteristics:

  • Versatility Across Soil Types: G11V76 performs well in various soil conditions.
  • Strong Drought Tolerance: It can withstand dry periods effectively.
  • Yield Potential: G11V76 offers excellent yield potential across different environments.
  • Fast Drydown: The hybrid dries down quickly after maturity.
  • Grain Quality: It produces good-quality grain.
  • Emergence: Dependable emergence even in stress environments.
  • Disease Tolerance: It exhibits tolerance to diseases such as Gray Leaf Spot, Northern Corn Leaf Blight, Goss’s Wilt, and more.
  • Agronomic Management: G11V76 performs well in continuous corn, drought-prone soils, high pH soils, and variable soil conditions.
  • Plant Characteristics: Moderately tall plant height, semi-flex ear placement, pink cob color, and upright leaf type.
  • Seeding Rates: The recommended seeding rates vary.

The collaboration between Syngenta and AWS showcases the transformative power of LLMs and AI agents. With the capability to embed human expertise and communicate in natural language, generative AI amplifies human abilities, allowing organizations to utilize knowledge at scale. This project is just one example of how Syngenta is using advaned AWS AI services to drive innovation in agriculture.

In the following sections, we provide a detailed overview of the Cropwise AI solution by diving deep into the underlying workflows. We explore how you can use Amazon Bedrock Agents with generative AI and cutting-edge AWS technologies, which offer a transformative approach to supporting sales reps across this industry (and beyond).

Solution overview

Cropwise AI is built on an AWS architecture designed to address these challenges through scalability, maintainability, and security. The architecture is divided into two main components: the agent architecture and knowledge base architecture. This solution is also deployed by using the AWS Cloud Development Kit (AWS CDK), which is an open-source software development framework that defines cloud infrastructure in modern programming languages and provisions it through AWS CloudFormation.

Agent architecture

The following diagram illustrates the serverless agent architecture with standard authorization and real-time interaction, and an LLM agent layer using Amazon Bedrock Agents for multi-knowledge base and backend orchestration using API or Python executors. Domain-scoped agents enable code reuse across multiple agents.

Amazon Bedrock Agents offers several key benefits for Syngenta compared to other solutions like LangGraph:

  • Flexible model selection – Syngenta can access multiple state-of-the-art foundation models (FMs) like Anthropic’s Claude 3.5 Haiku and Sonnet, Meta Llama 3.1, and others, and can switch between these models without changing code. They can select the model that is accurate enough for a specific workflow and yet cost-effective.
  • Ease of deployment – It is seamlessly integrated with other AWS services and has a unified development and deployment workflow.
  • Enterprise-grade security – With the robust security infrastructure of AWS, Amazon Bedrock is in scope for common compliance standards, including ISO, SOC, and CSA STAR Level 2; is HIPAA eligible; and you can use Amazon Bedrock in compliance with the GDPR.
  • Scalability and integration – It allows for straightforward API integration with existing systems and has built-in support for orchestrating multiple actions. This enables Syngenta to effortlessly build and scale their AI application.

The agent architecture handles user interactions and processes data to deliver accurate recommendations. It uses the following AWS services:

  • Serverless computing with AWS Lambda – The architecture begins with AWS Lambda, which provides serverless computing power, allowing for automatic scaling based on workload demands. When custom processing tasks are required, such as invoking the Amazon Bedrock agent or integrating with various data sources, the Lambda function is triggered to run these tasks efficiently.
  • Lambda-based action groups – The Amazon Bedrock agent directs user queries to functional actions which may use API-connections to gather data for use in workflows from various sources, model integrations to generate recommendations using the gathered data, or Python executions to extract specific pieces of information relevant to a user’s workflow and aid in product comparisons.
  • Secure user and data management – User authentication and authorization are managed centrally and securely through Amazon Cognito. This service makes sure user identities and access rights are handled effectively, maintaining the confidentiality and security of the system. The user identity gets propagated over a secure side channel (session attributes) to the agent and action groups. This allows them to access user-specific or restricted information, whereas each access can be authorized within the workflow. The session attributes aren’t shared with the LLM, making sure that authorization decisions are done on validated and tamper-proof data. For more information about this approach, see Implement effective data authorization mechanisms to secure your data used in generative AI applications.
  • Real-time data synchronization with AWS AppSync – To make sure that users always have access to the most up-to-date information, the solution uses AWS AppSync. It facilitates real-time data synchronization and updates by using GraphQL APIs, providing seamless and responsive user experiences.
  • Efficient metadata storage with Amazon DynamoDB – To support quick and efficient data retrieval, document metadata is stored in Amazon DynamoDB. This NoSQL database is optimized for rapid access, making sure the knowledge base remains responsive and searchable.
  • Centralized logging and monitoring with Amazon CloudWatch – To maintain operational excellence, Amazon CloudWatch is employed for centralized logging and monitoring. It provides deep operational insights, aiding in troubleshooting, performance tuning, and making sure the system runs smoothly.

The architecture is designed for flexibility and resilience. AWS Lambda enables the seamless execution of various tasks, including data processing and API integration, and AWS AppSync provides real-time interaction and data flow between the user and the system. By using Amazon Cognito for authentication, the agent maintains confidentiality, protecting sensitive user data.

Knowledge base architecture

The following diagram illustrates the knowledge base architecture.

The knowledge base architecture focuses on processing and storing agronomic data, providing quick and reliable access to critical information. Key components include:

  • Orchestrated document processing with AWS Step Functions – The document processing workflow begins with AWS Step Functions, which orchestrates each step in the process. From the initial ingestion of documents to their final storage, Step Functions makes sure that data handling is seamless and efficient.
  • Automated text extraction with Amazon Textract – As documents are uploaded to Amazon Simple Storage Service (Amazon S3), Amazon Textract is triggered to automatically extract text from these documents. This extracted text is then available for further analysis and the creation of metadata, adding layout-based structure and meaning to the raw data.
  • Primary data storage with Amazon S3 – The processed documents, along with their associated metadata, are securely stored in Amazon S3. This service acts as the primary storage solution, providing consistent access and organized data management for all stored content.
  • Efficient metadata storage with DynamoDB – To support quick and efficient data retrieval, document metadata is stored in DynamoDB.
  • Amazon Bedrock Knowledge Bases – The final textual content and metadata information gets ingested into Amazon Bedrock Knowledge Bases for efficient retrieval during the agentic workflow, backed by an Amazon OpenSearch Service vector store. Agents can use one or multiple knowledge bases, depending on the context in which they are used.

This architecture enables comprehensive data management and retrieval, supporting the agent’s ability to deliver precise recommendations. By integrating Step Functions with Amazon Textract, the system automates document processing, reducing manual intervention and improving efficiency.

Use cases

Cropwise AI addresses several critical use cases, providing tangible benefits to sales representatives and growers:

  • Product recommendation – A sales representative or grower seeks advice on the best seed choices for specific environmental conditions, such as “My region is very dry and windy. What corn hybrids do you suggest for my field?”. The agent uses natural language processing (NLP) to understand the query and uses underlying agronomy models to recommend optimal seed choices tailored to specific field conditions and agronomic needs. By integrating multiple data sources and explainable research results, including weather patterns and soil data, the agent delivers personalized and context-aware recommendations.
  • Querying agronomic models – A grower has questions about plant density or other agronomic factors affecting yield, such as “What are the yields I can expect with different seeding rates for corn hybrid G11V76?” The agent interprets the query, accesses the appropriate agronomy model, and provides a simple explanation that is straightforward for the grower to understand. This empowers growers to make informed decisions based on scientific insights, enhancing crop management strategies.
  • Integration of multiple data sources – A grower can ask for a recommendation that considers real-time data like current weather conditions or market prices, such as “Is it a good time to apply herbicides to my corn field?” The agent pulls data from various sources, integrates it with existing agronomy models, and provides a recommendation that accounts for current conditions. This holistic approach makes sure that recommendations are timely, relevant, and actionable.

Results

The implementation of Cropwise AI has yielded significant improvements in the efficiency and accuracy of agricultural product recommendations:

  • Sales representatives are now able to generate recommendations with analytical models five times faster, allowing them to focus more time on building relationships with customers and exploring new opportunities
  • The natural language interface simplifies interactions, reducing the learning curve for new users and minimizing the need for extensive training
  • The agent’s ability to track recommendation outcomes provides valuable insights into customer preferences and helps to improve personalization over time

To evaluate the results, Syngenta collected a dataset of 100 Q&A pairs from sales representatives and ran them against the agent. Next to manual human evaluation, they also used an LLM as a judge (Ragas) to assess the answers generated by Cropwise AI. The following graph shows the results of this evaluation, which indicate that the provided answer relevancy, conciseness, and faithfulness are very high.

Conclusion

Cropwise AI is revolutionizing the agricultural industry by addressing the unique challenges faced by seed representatives, particularly those managing multiple seed products for growers. This AI-powered tool streamlines the process of placing diverse seed products, making it effortless for sales reps to deliver precise recommendations tailored to each grower’s unique needs. By using advanced generative AI and AWS technologies, such as Amazon Bedrock Agents, Cropwise AI significantly boosts operational efficiency, enhancing the accuracy, speed, and user experience of product recommendations.

The success of this solution highlights AI’s potential to transform traditional agricultural practices, opening doors for further innovations across the sector. As Cropwise AI continues to evolve, efforts will focus on expanding capabilities, enhancing data integration, and maintaining compliance with shifting regulatory standards.

Ultimately, Cropwise AI not only refines the sales process but also empowers sales representatives and growers with actionable insights and robust tools essential for thriving in a dynamic agricultural environment. By fostering an efficient, intuitive recommendation process, Cropwise AI optimizes crop yields and enhances overall customer satisfaction, positioning it as an invaluable resource for the modern agricultural sales force.

For more details, explore the Amazon Bedrock Samples GitHub repo and Syngenta Cropwise AI.


About the Authors

Zach Marston is a Digital Product Manager at Syngenta, focusing on computational agronomy solutions. With a PhD in Entomology and Plant Pathology, he combines scientific knowledge with over a decade of experience in agricultural machine learning. Zach is dedicated to exploring innovative ways to enhance farming efficiency and sustainability through AI and data-driven approaches.

Serg Masis is a Senior Data Scientist at Syngenta, and has been at the confluence of the internet, application development, and analytics for the last two decades. He’s the author of the bestselling book “Interpretable Machine Learning with Python,” and the upcoming book “DIY AI.” He’s passionate about sustainable agriculture, data-driven decision-making, responsible AI, and making AI more accessible.

Arlind Nocaj is a Senior Solutions Architect at AWS in Zurich, Switzerland, who guides enterprise customers through their digital transformation journeys. With a PhD in network analytics and visualization (Graph Drawing) and over a decade of experience as a research scientist and software engineer, he brings a unique blend of academic rigor and practical expertise to his role. His primary focus lies in using the full potential of data, algorithms, and cloud technologies to drive innovation and efficiency. His areas of expertise include machine learning and MLOps, with particular emphasis on document processing, natural language processing, and large language models.

Victor Antonino, M.Eng, is a Senior Machine Learning Engineer at AWS with over a decade of experience in generative AI, computer vision, and MLOps. At AWS, Victor has led transformative projects across industries, enabling customers to use cutting-edge machine learning technologies. He designs modern data architectures and enables seamless machine learning deployments at scale, supporting diverse use cases in finance, manufacturing, healthcare, and media. Victor holds several patents in AI technologies, has published extensively on clustering and neural networks, and actively contributes to the open source community with projects that democratize access to AI tools.

Laksh Puri is a Generative AI Strategist at the AWS Generative AI Innovation Center, based in London. He works with large organizations across EMEA on their AI strategy, including advising executive leadership to define and deploy impactful generative AI solutions.

Hanno Bever is a Senior Machine Learning Engineer in the AWS Generative AI Innovation Center based in Berlin. In his 5 years at Amazon, he has helped customers across all industries run machine learning workloads on AWS. He is specialized in migrating foundation model training and inference tasks to AWS silicon chips AWS Trainium and AWS Inferentia.

Read More

Speed up your AI inference workloads with new NVIDIA-powered capabilities in Amazon SageMaker

Speed up your AI inference workloads with new NVIDIA-powered capabilities in Amazon SageMaker

This post is co-written with Abhishek Sawarkar, Eliuth Triana, Jiahong Liu and Kshitiz Gupta from NVIDIA. 

At re:Invent 2024, we are excited to announce new capabilities to speed up your AI inference workloads with NVIDIA accelerated computing and software offerings on Amazon SageMaker. These advancements build upon our collaboration with NVIDIA, which includes adding support for inference-optimized GPU instances and integration with NVIDIA technologies. They represent our continued commitment to delivering scalable, cost-effective, and flexible GPU-accelerated AI inference capabilities to our customers.

Today, we are introducing three key advancements that further expand our AI inference capabilities:

  1. NVIDIA NIM microservices are now available in AWS Marketplace for SageMaker Inference deployments, providing customers with easy access to state-of-the-art generative AI models.
  2. NVIDIA Nemotron-4 is now available on Amazon SageMaker JumpStart, significantly expanding the range of high-quality, pre-trained models available to our customers. This integration provides a powerful multilingual model that excels in reasoning benchmarks.
  3. Inference-optimized P5e and G6e instances are now generally available on Amazon SageMaker, giving customers access to NVIDIA H200 Tensor Core and L40S GPUs for AI inference workloads.

In this post, we will explore how you can use these new capabilities to enhance your AI inference on Amazon SageMaker. We’ll walk through the process of deploying NVIDIA NIM microservices from AWS Marketplace for SageMaker Inference. We’ll then dive into NVIDIA’s model offerings on SageMaker JumpStart, showcasing how to access and deploy the Nemotron-4 model directly in the JumpStart interface. This will include step-by-step instructions on how to find the Nemotron-4 model in the JumpStart catalog, select it for your use case, and deploy it with a few clicks. We’ll also demonstrate how to fine-tune and optimize this model for your specific requirements. Additionally, we’ll introduce you to the new inference-optimized P5e and G6e instances powered by NVIDIA H200 and L40S GPUs, showcasing how they can significantly boost your AI inference performance. By the end of this post, you’ll have a practical understanding of how to implement these advancements in your own AI projects, enabling you to accelerate your inference workloads and drive innovation in your organization.

Announcing NVIDIA NIM in AWS Marketplace for SageMaker Inference

NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, offers a set of high-performance microservices designed to help organizations rapidly deploy and scale generative AI applications on NVIDIA-accelerated infrastructure. SageMaker Inference is a fully managed capability for customers to run generative AI and machine learning models at scale, providing purpose-built features and a broad array of inference-optimized instances. AWS Marketplace serves as a curated digital catalog where customers can find, buy, deploy, and manage third-party software, data, and services needed to build solutions and run businesses. We’re excited to announce that AWS customers can now access NVIDIA NIM microservices for SageMaker Inference deployments through the AWS Marketplace , simplifying the deployment of generative AI models and helping partners and enterprises to scale their AI capabilities. The initial availability includes a portfolio of models packaged as NIM microservices, expanding the options for AI inference on Amazon SageMaker, including:

  • NVIDIA Nemotron-4: a cutting-edge large language model (LLM) designed to generate diverse synthetic data that closely mimics real-world data, enhancing the performance and robustness of custom LLMs across various domains.
  • Llama 3.1 8B-Instruct: an 8-billion-parameter multilingual LLM that is a pre-trained and instruction-tuned generative model optimized for language understanding, reasoning, and text generation use cases.
  • Llama 3.1 70B-Instruct: a 70-billion-parameter pre-trained, instruction-tuned model optimized for multilingual dialogue.
  • Mixtral 8x7B Instruct v0.1: a high-quality sparse mixture of experts model (SMoE) with open weights that can follow instructions, complete requests, and generate creative text formats.

Key benefits of deploying NIM on AWS

  • Ease of deployment: AWS Marketplace integration makes it straightforward to select and deploy models directly, eliminating complex setup processes. Select your preferred model from the marketplace, configure your infrastructure options, and deploy within minutes.
  • Seamless integration with AWS services: AWS offers robust infrastructure options, including GPU-optimized instances for inference, managed AI services such as SageMaker, and Kubernetes support with EKS, helping your deployments scale effectively.
  • Security and control: Maintain full control over your infrastructure settings on AWS, allowing you to optimize your runtime environments to match specific use cases.

How to get started with NVIDIA NIM on AWS

To deploy NVIDIA NIM microservices from the AWS Marketplace, follow these steps:

  1. Visit the NVIDIA NIM page on the AWS Marketplace and select your desired model, such as Llama 3.1 or Mixtral.
  2. Choose the AWS Regions to deploy to, GPU instance types, and resource allocations to fit your needs.
  3. Use the notebook examples to start your deployment using SageMaker to create the model, configure the endpoint, and deploy the model, and AWS will handle the orchestration of resources, networking, and scaling as needed.

NVIDIA NIM microservices in the AWS Marketplace facilitates seamless deployment in SageMaker so that organizations across various industries can develop, deploy, and scale their generative AI applications more quickly and effectively than ever.

SageMaker JumpStart now includes NVIDIA models: Introducing NVIDIA NIM microservices for Nemotron models

SageMaker JumpStart is a model hub and no-code solution within SageMaker that makes advanced AI inference capabilities more accessible to AWS customers by providing a streamlined path to access and deploy popular models from different providers. It offers an intuitive interface where organizations can easily deploy popular AI models with a few clicks, eliminating the complexity typically associated with model deployment and infrastructure management. The integration offers enterprise-grade features including model evaluation metrics, fine-tuning and customization capabilities, and collaboration tools, all while giving customers full control of their deployment.

We are excited to announce that NVIDIA models are now available in SageMaker JumpStart, marking a significant milestone in our ongoing collaboration. This integration brings NVIDIA’s cutting-edge AI models directly to SageMaker Inference customers, starting with the powerful Nemotron-4 model. With JumpStart, customers can access their state-of-the-art models within the SageMaker ecosystem to combine NVIDIA’s AI models with the scalable and price performance inference from SageMaker.

Support for Nemotron-4 – A multilingual and fine-grained reasoning model

We are also excited to announce that NVIDIA Nemotron-4 is now available in JumpStart model hub. Nemotron-4 is a cutting-edge LLM designed to generate diverse synthetic data that closely mimics real-world data, enhancing the performance and robustness of custom LLMs across various domains. Compact yet powerful, it has been fine-tuned on carefully curated datasets that emphasize high-quality sources and underrepresented domains. This refined approach enables strong results in commonsense reasoning, mathematical problem-solving, and programming tasks. Moreover, Nemotron-4 exhibits outstanding multilingual capabilities compared to similarly sized models, and even outperforms those over four times larger and those explicitly specialized for multilingual tasks.

Nemotron-4 – performance and optimization benefits

Nemotron-4 demonstrates great performance in common sense reasoning tasks like SIQA, ARC, PIQA, and Hellaswag with an average score of 73.4, outperforming similarly sized models and demonstrating similar performance against larger ones such as Llama-2 34B. Its exceptional multilingual capabilities also surpass specialized models like mGPT 13B and XGLM 7.5B on benchmarks like XCOPA and TyDiQA, highlighting its versatility and efficiency. When deployed through NVIDIA NIM microservices on SageMaker, these models deliver optimized inference performance, allowing businesses to generate and validate synthetic data with unprecedented speed and accuracy.

Through SageMaker JumpStart, customers can access pre-optimized models from NVIDIA that significantly simplify deployment and management. These containers are specifically tuned for NVIDIA GPUs on AWS, providing optimal performance out of the box. NIM microservices deliver efficient deployment and scaling, allowing organizations to focus on their use cases rather than infrastructure management.

Quick start guide

  1. From SageMaker Studio console, select JumpStart and choose the NVIDIA model family as shown in the following image.
  2. Select the NVIDIA Nemotron-4 NIM microservice.
  3. On the model details page, choose Deploy, and a pop-up window will remind you that you need an AWS Marketplace subscription. If you haven’t subscribed to this model, you can choose Subscribe, which will direct you to the AWS Marketplace to complete the subscription. Otherwise, you can choose Deploy to proceed with model deployment.
  4. On the model deployment page, you can configure the endpoint name, select the endpoint instance type and instance count, in addition to other advanced settings, such as IAM role and VPC setting.
  5. After you finish setting up the endpoint and choose Deploy at the bottom right corner, the NVIDIA Nemotron-4 model will be deployed to a SageMaker endpoint. After the endpoint’s status is In Service, you can start testing the model by invoking the endpoint using the following code. Take a look at the example notebook if you want to deploy the model programmatically.
     messages = [
     {"role": "user", "content": "Hello! How are you?"},
     {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
     {"role": "user", "content": "Write a short limerick about the wonders of GPU Computing."}
    ]
    payload = {
     "model": payload_model,
     "messages": messages,
     "max_tokens": 100,
     "stream": True
    }
    response = client.invoke_endpoint_with_response_stream(
     EndpointName=endpoint_name,
     Body=json.dumps(payload),
     ContentType="application/json",
     Accept="application/jsonlines",
    )

  6. To clean up the endpoint, you can delete the endpoint from the SageMaker Studio console or call the delete endpoint API.
    sagemaker.delete_endpoint(EndpointName=<endpoint_name>)

SageMaker JumpStart provides an additional streamlined path to access and deploy NVIDIA NIM microservices, making advanced AI capabilities even more accessible to AWS customers. Through JumpStart’s intuitive interface, organizations can deploy Nemotron models with a few clicks, eliminating the complexity typically associated with model deployment and infrastructure management. The integration offers enterprise-grade features including model evaluation metrics, customization capabilities, and collaboration tools, all while maintaining data privacy within the customer’s VPC. This comprehensive integration enables organizations to accelerate their AI initiatives while using the combined strengths of the scalable infrastructure provided by AWS and NVIDIA’s optimized models.

P5e and G6e instances powered by NVIDIA H200 Tensor Core and L40S GPUs are now available on SageMaker Inference

SageMaker now supports new P5e and G6e instances, powered by NVIDIA GPUs for AI inference.

P5e instances use NVIDIA H200 Tensor Core GPUs for AI and machine learning. These instances offer 1.7 times larger GPU memory and 1.4 times higher memory bandwidth than previous generations. With eight powerful H200 GPUs per instance connected using NVIDIA NVLink for seamless GPU-to-GPU communication and blazing-fast 3,200 Gbps multi-node networking through EFA technology, P5e instances are purpose-built for deploying and training even the most demanding ML models. These instances deliver performance, reliability, and scalability for your cutting-edge inference applications.

G6e instances, powered by NVIDIA L40S GPUs, are one of the most cost-efficient GPU instances for deploying generative AI models and the highest-performance universal GPU instances for spatial computing, AI, and graphics workloads. They offer 2 times higher GPU memory (48 GB) and 2.9 times faster GPU memory bandwidth compared to G6 instances. G6e instances deliver up to 2.5 times better performance compared to G5 instances. Customers can use G6e instances to deploy LLMs and diffusion models for generating images, video, and audio. G6e instances feature up to eight NVIDIA L40S GPUs with 384 GB of total GPU memory (48 GB of memory per GPU) and third-generation AMD EPYC processors. They also support up to 192 vCPUs, up to 400 Gbps of network bandwidth, up to 1.536 TB of system memory, and up to 7.6 TB of local NVMe SSD storage.

Both instances’ families are now available on SageMaker Inference. Checkout AWS Region availability and pricing on our pricing page.

Conclusion

These new capabilities let you deploy NVIDIA NIM microservices on SageMaker through the AWS Marketplace, use new NVIDIA Nemotron models, and tap the latest GPU instance types to power your ML workloads. We encourage you to give these offerings a look and use them to accelerate your AI workloads on SageMaker Inference.


About the authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Eliuth Triana is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within Cloud platforms & enhancing user experience on accelerated computing.

Jiahong Liu is a Solutions Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA-accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking, and wildlife watching.

Read More

Unlock cost savings with the new scale down to zero feature in SageMaker Inference

Unlock cost savings with the new scale down to zero feature in SageMaker Inference

Today at AWS re:Invent 2024, we are excited to announce a new feature for Amazon SageMaker inference endpoints: the ability to scale SageMaker inference endpoints to zero instances. This long-awaited capability is a game changer for our customers using the power of AI and machine learning (ML) inference in the cloud. Previously, SageMaker inference endpoints maintained a minimum number of instances to provide continuous availability, even during periods of low or no traffic. With this update, available when using SageMaker inference components, you have more options to align your resource usage with your specific needs and traffic patterns.

Refer to the accompanying notebooks to get started with the new scale down to zero feature.

The new feature expands the possibilities for managing SageMaker inference endpoints. It allows you to configure the endpoints so they can scale to zero instances during periods of inactivity, providing an additional tool for resource management. With this feature, you can closely match your compute resource usage to your actual needs, potentially reducing costs during times of low demand. This enhancement builds upon the existing auto scaling capabilities in SageMaker, offering more granular control over resource allocation. You can now configure your scaling policies to include scaling to zero, allowing for more precise management of your AI inference infrastructure.

The scale down to zero feature presents new opportunities for how businesses can approach their cloud-based ML operations. It provides additional options for managing resources across various scenarios, from development and testing environments to production deployments with variable traffic patterns. As with any new feature, you are encouraged to carefully evaluate how it fits into your overall architecture and operational needs, considering factors such as response times and the specific requirements of your applications.

In this post, we explore the new scale to zero feature for SageMaker inference endpoints, demonstrating how to implement and use this capability to optimize costs and manage resources more effectively. We cover the key scenarios where scaling to zero is beneficial, provide best practices for optimizing scale-up time, and walk through the step-by-step process of implementing this functionality. Additionally, we discuss how to set up scheduled scaling actions for predictable traffic patterns and test the behavior of your scaled-to-zero endpoints.

Determining when to scale to zero

Before we dive into the implementation details of the new scale to zero feature, it’s crucial to understand when and why you should consider using it. Although the ability to scale SageMaker inference endpoints to zero instances offers significant cost-saving potential, it’s crucial to understand when and how to apply this feature effectively. Not all scenarios benefit equally from scaling to zero, and in some cases, it may even impact the performance of your applications. Let’s explore why it’s important to carefully consider when to implement this feature and how to identify the scenarios where it provides the most value.

The ability to scale SageMaker inference endpoints to zero instances is particularly beneficial in three key scenarios:

  • Predictable traffic patterns – If your inference traffic is predictable and follows a consistent schedule, you can use this scaling functionality to automatically scale down to zero during periods of low or no usage. This eliminates the need to manually delete and recreate inference components and endpoints.
  • Sporadic or variable traffic – For applications that experience sporadic or variable inference traffic patterns, scaling down to zero instances can provide significant cost savings. However, scaling from zero instances back up to serving traffic is not instantaneous. During the scale-out process, any requests sent to the endpoint will fail, and these NoCapacityInvocationFailures will be captured in Amazon CloudWatch.
  • Development and testing environments – The scale to zero functionality is also beneficial when testing and evaluating new ML models. During model development and experimentation, you might create temporary inference endpoints to test different configurations. However, it’s possible to forget to delete these endpoints when you’re done. Scaling down to zero makes sure these test endpoints automatically scale back to zero instances when not in use, preventing unwanted charges. This allows you to freely experiment without closely monitoring infrastructure usage or remembering to manually delete endpoints. The automatic scaling to zero provides a cost-effective way to test out ideas and iterate on your ML solutions.

By carefully evaluating your specific use case against these scenarios, you can make informed decisions about implementing scale to zero functionality. This approach makes sure you maximize cost savings without compromising on the performance and availability requirements of your ML applications. It’s important to note that although scaling to zero can provide significant benefits, it also introduces a trade-off in terms of initial response time when scaling back up. Therefore, it’s crucial to assess whether your application can tolerate this potential delay and to implement appropriate strategies to manage it. In the following sections, we dive deeper into each scenario and provide guidance on how to determine if scaling to zero is the right choice for your specific needs. We also discuss best practices for implementation and strategies to mitigate potential drawbacks.

Scale down to zero is only supported when using inference components. For more information on inference components, see Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.

Now that we understand when to use the scale to zero feature, let’s dive into how to optimize its performance and implement it effectively. Scaling up from zero instances to serving traffic introduces a brief delay (cold start), which can impact your application’s responsiveness. To mitigate this, we first explore best practices for minimizing scale-up time. Then we walk through the step-by-step process of implementing the scale to zero functionality for your SageMaker inference endpoints.

Optimizing scale-up time best practices

When using the scale to zero feature, it’s crucial to minimize the time it takes for your endpoint to scale up and begin serving requests. The following are several best practices you can implement to decrease the scale-out time for your SageMaker inference endpoints:

  • Decrease model or container download time – Use uncompressed model format to reduce the time it takes to download the model artifacts when scaling up. Compressed model files may save storage space, but they require additional time to uncompress and files can’t be downloaded in parallel, which can slow down the scale-up process. To learn more, see Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference.
  • Reduce model server startup time – Look for ways to optimize the startup and initialization of your model server container. This could include techniques like building in packages into the image, using multi-threading, or minimizing unnecessary initialization steps. For more details, see Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1.
  • Use faster auto scaling metrics – Take advantage of more granular auto scaling metrics like ConcurrentRequestsPerCopy to more accurately monitor and react to changes in inference traffic. These sub-minute metrics can help trigger scale-out actions more precisely, reducing the number of NoCapacityInvocationFailures your users might experience. For more information, see Amazon SageMaker inference launches faster auto scaling for generative AI models.
  • Handle failed requests – When scaling from zero instances, there will be a brief period where requests fail due to NoCapacityInvocationFailures because SageMaker provisions resources. To handle this, you can use queues or implement client-side retries:
  • Use a serverless queue like Amazon Simple Queue Service (Amazon SQS) to buffer requests during scale-out. When a failure occurs, enqueue the request and dequeue after the model copies have scaled up from zero.
  • Alternatively, have your client reject failed requests, but then retry after some time after the model copies have scaled. You can retrieve the number of copies of an inference component at any time by making the DescribeInferenceComponent API call and checking the CurrentCopyCount. This allows time for the model copies to scale out from zero, transparently handling the transition for end-users.

By implementing these best practices, you can help make sure your SageMaker inference endpoints can scale out quickly and efficiently to meet changes in traffic, providing a responsive and reliable experience for your end-users.

Solution overview

With these best practices in mind, let’s now walk through the process of enabling your SageMaker inference endpoints to scale down to zero instances. This process involves a few key steps that are crucial for optimizing your endpoint’s performance and cost-efficiency:

  • Configure your endpoint – The first and most critical step is to enable managed instance scaling for your SageMaker endpoint. This is the foundational action that allows you to implement advanced scaling features, including scaling to zero. By enabling managed instance scaling, you’re creating an inference component endpoint, which is essential for the fine-grained control over scaling behaviors we discuss later in this post. After you configure managed instance scaling, you then configure the SageMaker endpoint to set the MinInstanceCount parameter to 0. This parameter allows the endpoint to scale all the way down to zero instances when not in use, maximizing cost-efficiency. Enabling managed instance scaling and setting MinInstanceCount to 0 work together to provide a highly flexible and cost-effective endpoint configuration. However, scaling up from zero will introduce cold starts, potentially impacting response times for initial requests after periods of inactivity. The inference component endpoint created through managed instance scaling serves as the foundation for implementing the sophisticated scaling policies we explore in the next step.
  • Define scaling policies – Next, you need to create two scaling policies that work in tandem to manage the scaling behavior of your endpoint effectively:
    • Scaling policy for inference component copies – This target tracking scaling policy will manage the scaling of your inference component copies. It’s a dynamic policy that adjusts the number of copies based on a specified metric, such as CPU utilization or request count. The policy is designed to scale the copy count to zero when there is no traffic, making sure you’re not paying for unused resources. Conversely, it will scale back up to your desired capacity when needed, allowing your endpoint to handle incoming requests efficiently. When configuring this policy, you need to carefully choose the target metric and threshold that best reflect your workload patterns and performance requirements.
    • Scale out from zero policy – This policy is crucial for enabling your endpoint to scale out from zero model copies when traffic arrives. It’s implemented as a step scaling policy that adds model copies when triggered by incoming requests. This allows SageMaker to provision the necessary instances to support the model copies and handle the incoming traffic. When configuring this policy, you need to consider factors such as the expected traffic patterns, the desired responsiveness of your endpoint, and the potential cold start latency. You may want to set up multiple steps in your policy to handle different levels of incoming traffic more granularly.

By implementing these scaling policies, you create a flexible and cost-effective infrastructure that can automatically adjust to your workload demands and scale to zero when needed.

Now let’s see how to use this feature step by step.

Set up your endpoint

The first crucial step in enabling your SageMaker endpoint to scale to zero is properly configuring the endpoint and its associated components. This process involves three main steps:

  1. Create the endpoint configuration and set MinInstanceCount to 0. This allows the endpoint to scale down all the way to zero instances when not in use.
    sagemaker_client.create_endpoint_config(
        EndpointConfigName=endpoint_config_name,
        ExecutionRoleArn=role,
        ProductionVariants=[
            {
                "VariantName": variant_name,
                "InstanceType": instance_type,
                "InitialInstanceCount": 1,
                "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
                "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
                "ManagedInstanceScaling": {
                    "Status": "ENABLED",
                    "MinInstanceCount": 0,
                    "MaxInstanceCount": max_instance_count,
                },
                "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
            }
        ],
    )

  2. Create the SageMaker endpoint:
    sagemaker_client.create_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=endpoint_config_name,
    )

  3. Create the inference component for your endpoint:
    sagemaker_client.create_inference_component(
        InferenceComponentName=inference_component_name,
        EndpointName=endpoint_name,
        VariantName=variant_name,
        Specification={
            "ModelName": model_name,
            "StartupParameters": {
                "ModelDataDownloadTimeoutInSeconds": 3600,
                "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            },
            "ComputeResourceRequirements": {
                "MinMemoryRequiredInMb": 1024,
                "NumberOfAcceleratorDevicesRequired": 1,
            },
        },
        RuntimeConfig={
            "CopyCount": 1,
        },
    )

Add scaling policies

After the endpoint is deployed and InService, you can add the necessary scaling policies:

  • A target tracking policy that can scale down the copy count for our inference component model copies to zero, and from 1 to n
  • A step scaling policy that will allow the endpoint to scale up from zero

Scaling policy for inference components model copies

After you create your SageMaker endpoint and inference components, you register a new auto scaling target for Application Auto Scaling. In the following code block, you set MinCapacity to 0, which is required for your endpoint to scale down to zero:

# Register scalable target
resource_id = f"inference-component/{inference_component_name}"
service_namespace = "sagemaker"
scalable_dimension = "sagemaker:inference-component:DesiredCopyCount"

aas_client.register_scalable_target(
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    MinCapacity=0,
    MaxCapacity=max_copy_count,  # Replace with your desired maximum number of model copies
)

After you have registered your new scalable target, the next step is to define your target tracking policy. In the following code example, we set the TargetValue to 5. This setting instructs the auto scaling system to increase capacity when the number of concurrent requests per model reaches or exceeds 5.

# Create Target Tracking Scaling Policy

aas_client.put_scaling_policy(
    PolicyName="inference-component-target-tracking-scaling-policy",
    PolicyType="TargetTrackingScaling",
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    TargetTrackingScalingPolicyConfiguration={
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution",
        },
        # Low TPS + load TPS
        "TargetValue": 5,  # you need to adjust this value based on your use case
        "ScaleInCooldown": 300,  # default
        "ScaleOutCooldown": 300,  # default
    },
)

Application Auto Scaling creates two CloudWatch alarms per scaling target. The first triggers scale-out actions after 1 minute (using one 1-minute data point), and the second triggers scale-in after 15 minutes (using 90 10-second data points). The time to trigger the scaling action is usually 1–2 minutes longer than those minutes because it takes time for the endpoint to publish metrics to CloudWatch, and it also takes time for AutoScaling to react.

Scale out from zero model copies policy

To enable your endpoint to scale out from zero instances, complete the following steps:

  1. Create a step scaling policy that defines when and how to scale out from zero. This policy will add one model copy when triggered, enabling SageMaker to provision the instances required to handle incoming requests after being idle. The following code shows you how to define a step scaling policy. Here we have configured to scale from zero to one model copy ("ScalingAdjustment": 1). Depending on your use case, you can adjust ScalingAdjustment as required.
    aas_client.put_scaling_policy(
        PolicyName="inference-component-step-scaling-policy",
        PolicyType="StepScaling",
        ServiceNamespace=service_namespace,
        ResourceId=resource_id,
        ScalableDimension=scalable_dimension,
        StepScalingPolicyConfiguration={
            "AdjustmentType": "ChangeInCapacity",
            "MetricAggregationType": "Maximum",
            "Cooldown": 60,
            "StepAdjustments":
              [
                 {
                   "MetricIntervalLowerBound": 0,
                   "ScalingAdjustment": 1 # you need to adjust this value based on your use case
                 }
              ]
        },
    )

  2. Create a CloudWatch alarm with the metric NoCapacityInvocationFailures.

When triggered, the alarm initiates the previously defined scaling policy. For more information about the NoCapacityInvocationFailures metric, see documentation.

We have also set the following:

  • EvaluationPeriods to 1
  • DatapointsToAlarm to 1
  • ComparisonOperator to GreaterThanOrEqualToThreshold

This results in waiting approximately 1 minute for the step scaling policy to trigger after our endpoint receives a single request.

cw_client.put_metric_alarm(
    AlarmName='ic-step-scaling-policy-alarm',
    AlarmActions=<step_scaling_policy_arn>,  # Replace with your actual ARN
    MetricName='NoCapacityInvocationFailures',
    Namespace='AWS/SageMaker',
    Statistic='Maximum',
    Dimensions=[
        {
            'Name': 'InferenceComponentName',
            'Value': inference_component_name  # Replace with actual InferenceComponentName
        }
    ],
    Period=30,
    EvaluationPeriods=1,
    DatapointsToAlarm=1,
    Threshold=1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing'
)

Replace <STEP_SCALING_POLICY_ARN> with the Amazon Resource Name (ARN) of the scaling policy you created in the previous step.

Notice the "MinInstanceCount": 0 setting in the endpoint configuration, which allows the endpoint to scale down to zero instances. With the scaling policy, CloudWatch alarm, and minimum instances set to zero, your SageMaker inference endpoint will now be able to automatically scale down to zero instances when not in use.

Test the solution

When our SageMaker endpoint doesn’t receive requests for 15 minutes, it will automatically scale down to zero the number of model copies:

time.sleep(500)
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)
    
desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
print(desc)

After 10 additional minutes of inactivity, SageMaker automatically stops all underlying instances of the endpoint, eliminating all associated instance costs.

If we try to invoke our endpoint while instances are scaled down to zero, we get a validation error:

An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.

sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name,
    Body=json.dumps(
        {
            "inputs": "The diamondback terrapin was the first reptile to be",
            "parameters": {
                "do_sample": True,
                "max_new_tokens": 256,
                "min_new_tokens": 256,
                "temperature": 0.3,
                "watermark": True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

However, after 1 minute, our step scaling policy should start. SageMaker will then start provisioning a new instance and deploy our inference component model copy to handle requests.

Schedule scaling down to zero

In some scenarios, you might observe consistent weekly traffic patterns: a steady workload Monday through Friday, and no traffic on weekends. You can optimize costs and performance by configuring scheduled actions that align with these patterns:

  • Weekend scale-in (Friday evening) – Configure a scheduled action to reduce the number of model copies to zero. This will instruct SageMaker to scale the number instance behind the endpoint to zero, completely eliminating costs during the weekend period of no usage.
  • Workweek scale-out (Monday morning) – Set up a complementary scheduled action to restore the required model capacity for the inference component on Monday morning, so your application is ready for weekday operations.

You can scale your endpoint to zero in two ways. The first method is to set the number of model copies to zero in your inference component using the UpdateInferenceComponentRuntimeConfig API. This approach maintains your endpoint configuration while eliminating compute costs during periods of inactivity.

sagemaker_client.update_inference_component_runtime_config(
    InferenceComponentName=inference_component_name,
    DesiredRuntimeConfig={
        'CopyCount': 0
    }
)

Amazon EventBridge Scheduler can automate SageMaker API calls using cron/rate expressions for recurring schedules or one-time invocations. To function, EventBridge Scheduler requires an execution role with appropriate permissions to invoke the target API operations on your behalf. For more information about how to create this role, see Set up the execution role. The specific permissions needed depend on the target API being called.

The following code creates two scheduled actions for the inference component during 2024–2025. The first schedule scales in the CopyCount to zero every Friday at 18:00 UTC+1, and the second schedule restores model capacity every Monday at 07:00 UTC+1. The schedule will start on November 29, 2024, end on December 31, 2025, and be deleted after completion.

import json
scheduler = boto3.client('scheduler')

flex_window = {
    "Mode": "OFF"
}

# We specify the SageMaker target API for the scale in schedule
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 0}, "InferenceComponentName": inference_component_name })
}

# Scale in our endpoint to 0 every friday at 18:00 UTC+1, starting on November 29, 2024
scheduler.create_schedule(
    Name="scale-to-zero-schedule",
    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_in_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

# Specify the SageMaker target API for the scale out schedule
scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 2}, "InferenceComponentName": inference_component_name })
}

# Scale out our endpoint every Monday at 07:00 UTC+1
scheduler.create_schedule(
    Name="scale-out-schedule",
    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_out_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

The second method is to delete the inference components by calling the DeleteInferenceComponent API. This approach achieves the same cost-saving benefit while completely removing the components from your configuration. The following code creates a scheduled action that automatically deletes the inference component every Friday at 18:00 UTC during 2024–2025. It also creates a complementary scheduled action that recreates the inference component every Monday at 07:00 UTC+1.

import json
scheduler = boto3.client('scheduler')

flex_window = {
    "Mode": "OFF"
}

# We specify the SageMaker target API for the scale in schedule
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:deleteInferenceComponent",
    "Input": json.dumps({"InferenceComponentName": inference_component_name })
}

# Scale in our endpoint by deleting the IC every friday at 18:00 UTC+1
scheduler.create_schedule(
    Name="scale-to-zero-schedule",
    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_in_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

# Specify the SageMaker target API for the scale up schedule
input_config = {
  "EndpointName": endpoint_name,
  "InferenceComponentName": inference_component_name,
  "RuntimeConfig": {
    "CopyCount": 2
  },
  "Specification": {
    "ModelName": model_name,
    "StartupParameters": {
        "ModelDataDownloadTimeoutInSeconds": 3600,
        "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
    },
    "ComputeResourceRequirements": {
      "MinMemoryRequiredInMb": 1024,
      "NumberOfAcceleratorDevicesRequired": 1
    }
  },
  "VariantName": variant_name
}

scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:createInferenceComponent",
    "Input": json.dumps(input_config)
}

# Scale out our endpoint by recreating the IC every Monday at 07:00 UTC+1
scheduler.create_schedule(
    Name="scale-out-schedule",
    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_out_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

To scale to zero on an endpoint with multiple inference components, all components must be either set to 0 or deleted. You can also automate this process by using EventBridge Scheduler to trigger an AWS Lambda function that handles either deletion or zero-setting of all inference components.

Performance evaluation

We evaluated the performance implications of the Scale to Zero feature by conducting tests using a Llama3-8B instruct model. These tests utilized container caching and optimized model loading techniques, and were performed with both Target Tracking and Step Scaling policies in place. Our findings for Llama3-8B instruct show that when using the Target Tracking policy, SageMaker will scale the endpoint to zero model copies in approximately 15 minutes, and then take an additional 10 minutes to fully scale down the underlying instances, for a total scale-in time of 25 minutes. Conversely, when scaling the endpoint back up from zero, the Step Scaling policy triggers the provisioning of new instances in around 1 minute, followed by provisioning the instance(s) in ~1.748 minutes. Scaling out of model copies in approximately 2.28 minutes, resulting in a total scale-out time of around 5.028 minutes.

The performance tests on LLaMa3.1 models (8B and 70B variants) demonstrate SageMaker’s Scale to Zero feature’s effectiveness, with intentionally conservative scaling times to prevent endpoint thrashing and accommodate spiky traffic patterns. For both model sizes, scaling in takes a total of 25 minutes, allowing a 15-minute buffer before initiating scale-down and an additional 10 minutes to fully decommission instances. This cautious approach helps avoid premature scaling during temporary lulls in traffic. When scaling out, the 8B model takes about 5 minutes, while the 70B model needs approximately 6 minutes. These times include a 1-minute trigger delay, followed by instance provisioning and model copy instantiation. The slightly longer scale-out times, especially for larger models, provide a balance between responsiveness and stability, ensuring the system can handle sudden traffic increases without constantly scaling up and down. This measured approach to scaling helps maintain consistent performance and cost-efficiency in environments with variable workloads.

LLaMa3.1 8B Instruct
Scale in Time to trigger target tracking (min) Time to scale in instance count to zero (min) Total time (min)
15 10 25
Scale out Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instatiate a new model copy (min) Total time (min)
1 1.748 2.28 5.028
LLaMa3.1 70B
Scale in Time to trigger target tracking (min) Time to scale in instance count to zero (min) Total time (min)
15 10 25
Scale out Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instatiate a new model copy (min) Total time (min)
1 3.018 1.984 6.002

Scale up Trials

LLaMa3.1 8B Instruct
Trial Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instantiate a new model copy (min) Total time (min)
1 1 1.96 3.1 6.06
2 1 1.75 2.6 5.35
3 1 1.4 2.1 4.5
4 1 1.96 1.9 4.86
5 1 1.67 1.7 4.37
Average 1 1.748 2.28 5.028
LLaMa3.1 70B
Trial Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instantiate a new model copy (min) Total time (min)
1 1 3.1 1.98 6.08
2 1 2.92 1.98 5.9
3 1 2.82 1.98 5.8
4 1 3.27 2 6.27
5 1 2.98 1.98 5.96
Average 1 3.018 1.984 6.002
  • Target Tracking: Scale Model Copies to Zero (min) – This refers to the time it took target tracking to trigger the alarm and SageMaker to decrease model copies to zero on the instance
  • Scale in Instance Count to Zero (min) – This refers to the time it takes SageMaker to scale the instances down to zero after all inference component model copies are zero
  • Step Scaling: Scale up Model Copies from Zero (min) – This refers to the time it took step scaling to trigger the alarm and for SageMaker to provision the instances
  • Scale out Instance Count from Zero (min) – This refers to the time it takes for SageMaker to scale out and add inference component model copies

If you want more customization and faster scaling, consider using step scaling to scale model copies instead of target tracking.

Customers testimonials

The new Scale to Zero feature for SageMaker inference endpoints has sparked considerable interest across customers. We gathered initial reactions from companies who have previewed and evaluated this capability, highlighting its potential impact on AI and machine learning operations.

Atlassian, headquartered in Sydney, Australia, is a software company specializing in collaboration tools for software development and project management:

“The new Scale to Zero feature for SageMaker inference strongly aligns with our commitment to efficiency and innovation. We’re enthusiastic about its potential to revolutionize how we manage our machine learning inference resources, and we look forward to integrating it into our operations”

– Guarav Awadhwal – Senior Engineering Manager at Atlassian

iFood is a Latin American online food delivery firm based in Brazil. It works with over 300,000 restaurants, connecting them with millions of customers every month.

“The Scale to Zero feature for SageMaker Endpoints will be fundamental for iFood’s Machine Learning Operations. Over the years, we’ve collaborated closely with the SageMaker team to enhance our inference capabilities. This feature represents a significant advancement, as it allows us to improve cost efficiency without compromising the performance and quality of our ML services, given that inference constitutes a substantial part of our infrastructure expenses.”

– Daniel Vieira, MLOps Engineer Manager at iFoods

VIDA, headquartered in Jakarta, Indonesia, is a leading digital identity provider that enable individuals and business to conduct business in a safe and secure digital environment.

“SageMaker’s new Scale to Zero feature for GPU inference endpoints shows immense promise for deep fake detection operations. The potential to efficiently manage our face liveness and document verification inference models while optimizing infrastructure costs aligns perfectly with our goals. We’re excited to leverage this capability to enhance our identity verification solutions.”

– Keshav Sharma, ML Platform Architect at VIDA

APOIDEA Group is a leading AI-focused FinTech ISV company headquartered in Hong Kong. Leveraging cutting-edge generative AI and deep learning technologies, the company develops innovative AI FinTech solutions for multinational banks. APOIDEA’s products automate repetitive human analysis tasks, extracting valuable financial insights from extensive financial documents to accelerate AI-driven transformation across the industry.

“SageMaker’s Scale to Zero feature is a game changer for our AI financial analysis solution in operations. It delivers significant cost savings by scaling down endpoints during quiet periods, while maintaining the flexibility we need for batch inference and model testing. This capability is transforming how we manage our GenAI workloads and evaluate new models. We’re eager to harness its power to further optimize our deep learning and NLP model deployments.”

– Mickey Yip, VP of Product at APOIDEA Group

Fortiro, based in Melbourne, Australia, is a FinTech company specializing in automated document fraud detection and financial verification for trusted financial institutions.

“The new Scale-to-Zero capability in SageMaker is a game-changer for our MLOps and delivers great cost savings. Being able to easily scale inference endpoints and GPUs means we can take advantage of a fast, highly responsive environment, without incurring unnecessary costs. Our R&D teams constantly experiment with new AI-based document fraud detection methods, which involves a lot of testing and repeating. This capability empowers us to do this both faster and more efficiently.”

– Amir Vahid , Chief Technology Officer at Fortiro

These testimonials underscore the anticipation for SageMaker’s Scale to Zero feature. As organizations begin to implement this capability, we expect to see innovative applications that balance cost efficiency with performance in machine learning deployments.

Conclusion

In this post, we introduced the new scale to zero feature in SageMaker, an innovative capability that enables you to optimize costs by automatically scaling in your inference endpoints when they’re not in use. We guided you through the detailed process of implementing this feature, including configuring endpoints, setting up auto scaling policies, and managing inference components for both automatic and scheduled scaling scenarios.

This cost-saving functionality presents new possibilities for how you can approach your ML operations. With this feature, you can closely align your compute resource usage with actual needs, potentially reducing costs during periods of low demand. We encourage you to try this capability and start optimizing your SageMaker inference costs today.

To help you get started quickly, we’ve prepared a comprehensive notebooks containing an end-to-end example of how to configure an endpoint to scale to zero.

We encourage you to try this capability and start optimizing your SageMaker inference costs today!


About the authors

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Christian Kamwangala is an AI/ML and Generative AI Specialist Solutions Architect at AWS, based in Paris, France. He helps enterprise customers architect and implement cutting-edge AI solutions using AWS’s comprehensive suite of tools, with a focus on production-ready systems that follow industry best practices. In his spare time, Christian enjoys exploring nature and spending time with family and friends.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Raghu Ramesha is a Senior GenAI/ML Solutions Architect on the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in computer science from UT Dallas. In his free time, he enjoys traveling and photography.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Raj Vippagunta is a Principal Engineer at Amazon SageMaker Machine Learning(ML) platform team in AWS. He uses his vast experience of 18+ years in large-scale distributed systems and his passion for machine learning to build practical service offerings in the AI and ML space. He has helped build various at-scale solutions for AWS and Amazon. In his spare time, he likes reading books, pursue long distance running and exploring new places with his family.

Read More

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference

Today at AWS re:Invent 2024, we are excited to announce the new Container Caching capability in Amazon SageMaker, which significantly reduces the time required to scale generative AI  models for inference. This innovation allows you to scale your models faster, observing up to 56% reduction in latency when scaling a new model copy and up to 30% when adding a model copy on a new instance. These improvements are available across a wide range of SageMaker’s Deep Learning Containers (DLCs), including Large Model Inference (LMI, powered by vLLM and multiple other frameworks), Hugging Face Text Generation Inference (TGI), PyTorch (Powered by TorchServe), and NVIDIA Triton. Fast container startup times are critical to scale generative AI models effectively, making sure end-users aren’t negatively impacted as inference demand increases.

As generative AI models and their hosting containers grow in size and complexity, scaling these models efficiently for inference becomes increasingly challenging. Until now, each time SageMaker scaled up an inference endpoint by adding new instances, it needed to pull the container image (often several tens of gigabytes in size) from Amazon Elastic Container Registry (Amazon ECR), a process that could take minutes. For generative AI models requiring multiple instances to handle high-throughput inference requests, this added significant overhead to the total scaling time, potentially impacting application performance during traffic spikes.

Container Caching addresses this scaling challenge by pre-caching the container image, eliminating the need to download it when scaling up. This new feature brings several key benefits for generative AI inference workloads: dramatically faster scaling to handle traffic spikes, improved resource utilization on GPU instances, and potential cost savings through more efficient scaling and reduced idle time during scale-up events. These benefits are particularly impactful for popular frameworks and tools like vLLM-powered LMI, Hugging Face TGI, PyTorch with TorchServe, and NVIDIA Triton, which are widely used in deploying and serving generative AI models on SageMaker inference.

In our tests, we’ve seen substantial improvements in scaling times for generative AI model endpoints across various frameworks. The implementation of Container Caching for running Llama3.1 70B model showed significant and consistent improvements in end-to-end (E2E) scaling times. We ran 5+ scaling simulations and observed consistent performance with low variations across trials. When scaling the model on an available instance, the E2E scaling time was reduced from 379 seconds (6.32 minutes) to 166 seconds (2.77 minutes), resulting in an absolute improvement of 213 seconds (3.55 minutes), or a 56% reduction in scaling time. This enhancement allows customers running high-throughput production workloads to handle sudden traffic spikes more efficiently, providing more predictable scaling behavior and minimal impact on end-user latency across their ML infrastructure, regardless of the chosen inference framework.

In this post, we explore the new Container Caching feature for SageMaker inference, addressing the challenges of deploying and scaling large language models (LLMs). We discuss how this innovation significantly reduces container download and load times during scaling events, a major bottleneck in LLM and generative AI inference. You’ll learn about the key benefits of Container Caching, including faster scaling, improved resource utilization, and potential cost savings. We showcase its real-world impact on various applications, from chatbots to content moderation systems. We then guide you through getting started with Container Caching, explaining its automatic enablement for SageMaker provided DLCs and how to reference cached versions. Finally, we delve into the supported frameworks, with a focus on LMI, PyTorch, Hugging Face TGI, and NVIDIA Triton, and conclude by discussing how this feature fits into our broader efforts to enhance machine learning (ML) workloads on AWS.

This feature is only supported when using inference components. For more information on inference components, see Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.

The challenge of deploying LLMs for inference

As LLMs and their respective hosting containers continue to grow in size and complexity, AI and ML engineers face increasing challenges in deploying and scaling these models efficiently for inference. The rapid evolution of LLMs, with some models now using hundreds of billions of parameters, has led to a significant increase in the computational resources and sophisticated infrastructure required to run them effectively.

One of the primary bottlenecks in the deployment process is the time required to download and load containers when scaling up endpoints or launching new instances. This challenge is particularly acute in dynamic environments where rapid scaling is crucial to maintain service quality. The sheer size of these containers, often ranging from several gigabytes to tens of gigabytes, can lead to substantial delays in the scaling process.

When a scale-up event occurs, several actions take place, each contributing to the total time between triggering a scale-up event and serving traffic from the newly added instances. These actions typically include:

  • Provisioning new compute resources
  • Downloading container image
  • Loading container image
  • Loading the model weights into memory
  • Initializing the inference runtime
  • Shifting traffic to serve new requests

The cumulative time for these steps can range from several minutes to tens of minutes, depending on the model size, runtime used by the model, and infrastructure capabilities. This delay can lead to suboptimal user experiences and potential service degradation during traffic spikes, making it a critical area for optimization in the field of AI inference infrastructure.

The introduction of Container Caching for SageMaker DLCs brings several key benefits for inference workloads:

  • Faster scaling – By having the latest DLCs pre-cached, the time required to scale inference endpoints in response to traffic spikes is substantially reduced. This provides a more consistent and responsive experience for inference hosting, allowing systems to adapt quickly to changing demand patterns. ML engineers can now design more aggressive auto scaling policies, knowing that new instances can be brought online in a fraction of the time previously required.
  • Quick endpoint startup – Using pre-cached containers significantly decreases the startup time for new model deployments. This acceleration in the deployment pipeline enables more frequent model updates and iterations, fostering a more agile development cycle. AI and ML engineers can now move from model training to production deployment with unprecedented speed, reducing time-to-market for new AI features and improvements.
  • Improved resource utilization – Container Caching minimizes idle time on expensive GPU instances during the initialization phase. Instead of waiting for container downloads, these high-performance resources can immediately focus on inference tasks. This optimization provides more efficient use of computational resources, potentially allowing for higher throughput and better cost-effectiveness.
  • Cost savings – The cumulative effect of faster deployments and more efficient scaling can lead to significant reductions in overall inference costs. By minimizing idle time and improving resource utilization, organizations can potentially serve the same workload with fewer instances or handle increased demand without proportional increases in infrastructure costs. Additionally, the improved responsiveness can lead to better user experiences, potentially driving higher engagement and revenue in customer-facing applications.
  • Enhanced compatibility – By focusing on the latest SageMaker DLCs, this caching mechanism makes sure users always have quick access to the most recent and optimized environments for their models. This can be particularly beneficial for teams working with cutting-edge AI technologies that require frequent updates to the underlying frameworks and libraries.

Container Caching represents a significant advancement in AI inference infrastructure. It addresses a critical bottleneck in the deployment process, empowering organizations to build more responsive, cost-effective, and scalable AI systems.

Getting started with Container Caching for inference

Container Caching is automatically enabled for popular SageMaker DLCs like LMI, Hugging Face TGI, NVIDIA Triton, and PyTorch used for inference. To use cached containers, you only need to make sure you’re using a supported SageMaker container. No additional configuration or steps are required.

The following table lists the supported DLCs.

SageMaker DLC Starting Version Starting Container
LMI 0.29.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
LMI-TRT 0.29.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124
LMI-Neuron 0.29.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1
TGI-GPU 2.4.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.4.0-gpu-py311-cu124-ubuntu22.04-v2.0
TGI-Neuron 2.1.2 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.25-neuronx-py310-ubuntu22.04-v1.0
Pytorch-GPU 2.5.1 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker
Pytorch-CPU 2.5.1 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-cpu-py311-ubuntu22.04-sagemaker
Triton 24.09 763104351884.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:24.09-py3

In the following sections, we discuss how to get started with several popular SageMaker DLCs.

Hugging Face TGI

Developed by Hugging Face, TGI is an inference framework for deploying and serving LLMs, offering a purpose-built solution that combines security, performance, and ease of management. TGI is specifically designed to deliver high-performance text generation through advanced features like tensor parallelism and continuous batching. It supports a wide range of popular open source LLMs, making it a popular choice for diverse AI applications. What sets TGI apart is its optimization for both NVIDIA GPUs and AWS accelerators with AWS Inferentia and AWS Trainium, providing optimal performance across different hardware configurations.

With the introduction of Container Caching, customers using the latest release of TGI containers on SageMaker will experience improved scaling performance. The caching mechanism works automatically, requiring no additional configuration or code changes. This seamless integration means that organizations can immediately benefit from faster scaling without any operational overhead.

Philipp Schmid, Technical Lead at Hugging Face, shares his perspective on this enhancement: “Hugging Face TGI containers are widely used by SageMaker inference customers, offering a powerful solution optimized for running popular models from the Hugging Face. We are excited to see Container Caching speed up auto scaling for users, expanding the reach and adoption of open models from Hugging Face.”

You can use Container Caching with Hugging Face TGI using the following code:

// Using Container Caching for Huggingface TGI
//Create an IC with Hugging face image

create_inference_component(
        image="763104351884.dkr.ecr.<region>.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.4.0-gpu-py311-cu124-ubuntu22.04-v2.0", 
        model_url= "s3://path/to/your/model/artifacts"
        )

** We will cache latest version of currently maintained images - https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only

NVIDIA Triton

NVIDIA Triton Inference Server is a model server from NVIDIA that supports multiple deep learning frameworks and model formats. On SageMaker, Triton offers a comprehensive serving stack with support for various backends, including TensorRT, PyTorch, Python, and more. Triton is particularly powerful because of its ability to optimize inference across different hardware configurations while providing features like dynamic batching, concurrent model execution, and ensemble models. The Triton architecture enables efficient model serving through features like multi-framework support, optimized GPU utilization, and flexible model management.

With Container Caching, Triton deployments on SageMaker become even more efficient, especially when scaling large-scale inference workloads. This is particularly beneficial when deploying multiple models using Triton’s Python backend or when running model ensembles that require complex preprocessing and postprocessing pipelines. This improves the deployment and scaling experience for Triton workloads by eliminating the need to repeatedly download container images during scaling events.

Eliuth Triana, Global Lead Amazon Developer Relations at NVIDIA, comments on this enhancement:

“The integration of Container Caching with NVIDIA Triton Inference Server on SageMaker represents a significant advancement in serving machine learning models at scale. This feature perfectly complements Triton’s advanced serving capabilities by reducing deployment latency and optimizing resource utilization during scaling events. For customers running production workloads with Triton’s multi-framework support and dynamic batching, Container Caching provides faster response to demand spikes while maintaining Triton’s performance optimizations.”

To use Container Caching with NVIDIA Triton, use the following code:

// Using Container Caching for Triton
create_inference_component( 
    image="763104351884.dkr.ecr.<region>.amazonaws.com/sagemaker-tritonserver:24.09-py3", 
    model_url="s3://path/to/your/model/artifacts" 
)

PyTorch and TorchServe (now with vLLM engine integration)

SageMaker Deep Learning Container for PyTorch is powered by TorchServe . It offers a comprehensive solution for deploying and serving PyTorch models, including Large Language Models (LLMs), in production environments. TorchServe provides robust model serving capabilities through HTTP REST APIs, like flexible configuration options and performance optimization features like server-side batching, multi-model serving and dynamic model loading. The container supports a wide range of models and advanced features, including quantization, and parameter-efficient methods like LoRA.

The latest version of PyTorch also uses TorchServe integrated with vLLM engine which leverages advanced features such as vLLM’s state-of-the-art inference engine with PagedAttention and continuous batching. It supports single-node, multi-GPU distributed inference, enabling tensor parallel sharding for larger models. The integration of Container Caching significantly reduces scaling times, particularly beneficial for large models during auto-scaling events. TorchServe’s handler system allows for easy customization of pre- and post-processing logic, making it adaptable to various use cases. With its growing feature set, TorchServe is a popular choice for deploying and scaling machine learning models among inference customers.

You can use Container Caching with PyTorch using the following code:

 // Using Container Caching for PyTorch 
 create_inference_component( 
    image="763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-inference:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker", 
    model_url="s3://path/to/your/model/artifacts" 
 )

LMI container

The Large Model Inference (LMI) container is a high-performance serving solution that can be used through a no-code interface with smart defaults that can be extended to fit your unique needs. LMI delivers performance differentiation through advanced optimizations, outpacing open source backends like vLLM, TensorRT-LLM, and Transformers NeuronX while offering a unified UI.

Popular features such as continuous batching, token streaming, and speculative decoding are available out of the box to provide superior throughput, latency, and scalability. LMI supports a wide array of use cases like multi-node inference and model personalization through LoRA adapters, and performance optimizations like quantization and compilation.

With Container Caching, LMI containers deliver even faster scaling capabilities, particularly beneficial for large-scale LLM deployments where container startup times can significantly impact auto scaling responsiveness. This enhancement works seamlessly across all supported backends while maintaining the container’s advanced features and optimization capabilities.

Contributors of LMI containers comment on this enhancement:

“The addition of Container Caching to LMI containers represents a significant step forward in making LLM deployment more efficient and responsive. This feature complements our efforts to speed up model loading through pre-sharding, weight streaming, and compiler caching, enabling customers to achieve both high-performance inference and rapid scaling capabilities, which is crucial for production LLM workloads.”

To use Container Caching with LMI, use the following code:

# Using Container Caching for LMI
create_inference_component(
    image= "763104351884.dkr.ecr.<region>.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124",
    model_url="s3://path/to/your/model/artifacts"
)

Performance Evaluation:

The implementation of Container Caching for running Llama3.1 70B model showed significant and consistent improvements in end-to-end (E2E) scaling times. We ran 5+ scaling simulations and observed consistent performance with low variations across trials. When scaling the model on an available instance, the E2E scaling time was reduced from 379 seconds (6.32 minutes) to 166 seconds (2.77 minutes), resulting in an absolute improvement of 213 seconds (3.55 minutes), or a 56% reduction in scaling time. For the scenario of scaling the model by adding a new instance, the E2E scaling time decreased from 580 seconds (9.67 minutes) to 407 seconds (6.78 minutes), yielding an improvement of 172 seconds (2.87 minutes), which translates to a 30% reduction in scaling time. These results demonstrate that Container Caching substantially and reliably enhances the efficiency of model scaling operations, particularly for large language models like Llama3.1 70B, with more pronounced benefits observed when scaling on existing instances.

To run this benchmark, we use sub-minute metrics to detect the need for scaling. For more details, see Amazon SageMaker inference launches faster auto scaling for generative AI models.

The following table summarizes our setup.

Region CMH
Instance Type p4d.24xlarge
Container LMI V13.31
Container Image 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
Model Llama 3.1 70B

Scaling the model by adding a new instance

For this scenario, we explore scaling the model by adding a new instance.

The following table summarizes the results when containers are not cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 223 339 602
2 40 203 339 582
3 40 175 339 554
4 40 210 339 589
5 40 191 339 570
Average 200 339 580

The following table summarizes the results after containers are cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 185 173 398
2 40 175 188 403
3 40 164 208 412
4 40 185 187 412
5 40 185 187 412
Average 178.8 188.6 407.4

Scaling the model on an available instance

In this scenario, we explore scaling the model on an available instance.

The following table summarizes the results when containers are not cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 339 379
2 40 339 379
3 40 339 379
4 40 339 379
5 40 339 379
Average 339 379

The following table summarizes the results after containers are cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 150 190
2 40 122 162
3 40 121 161
4 40 119 159
5 40 119 159
Average 126.2 166.2

Summary of findings

The following table summarizes our results in both scenarios.

. End-to End Scaling Time Before End-to-End Scaling Time After Improvement in Absolute Numbers % Improvements
Scaling the model on an available instance 379 166 213 56
Scaling the model by adding a new instance 580 407 172 30

Customers using ODCRs for GPUs may experience a lower time to spin up new instances as compared to on demand depending on instance type.

Conclusion

Container Caching for inference is just one of the many ways SageMaker can improve the efficiency and performance of ML workloads on AWS. We encourage you to try out this new feature for your inference workloads and share your experiences with us. Your feedback is invaluable as we continue to innovate and improve our ML platform.

To learn more about Container Caching and other SageMaker features for inference, refer to Amazon SageMaker Documentation or check out our GitHub repositories for examples and tutorials on deploying models for inference.


About the Authors

Wenzhao Sun, PhD, is a Sr. Software Dev Engineer with the SageMaker Inference team. He possesses a strong passion for pushing the boundaries of technical solutions, striving to maximize their theoretical potential. His primary focus is on delivering secure, high-performance, and user-friendly machine learning features for AWS customers. Outside of work, he enjoys traveling and video games.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Author - Aakash DeepAakash Deep is a Software Development Engineering Manager with the Amazon SageMaker Inference team. He enjoys working on machine learning and distributed systems. His mission is to deliver secure, highly performant, highly scalable and user friendly machine learning features for AWS customers. Outside of work, he enjoys hiking and traveling.

Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring fusion cuisines, traveling, and spending time with family and friends.

Read More