Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3

Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3

This is a guest post co-written with Ratnesh Jamidar and Vinayak Trivedi from Sprinklr.

Sprinklr’s mission is to unify silos, technology, and teams across large, complex companies. To achieve this, we provide four product suites, Sprinklr Service, Sprinklr Insights, Sprinklr Marketing, and Sprinklr Social, as well as several self-serve offerings.

Each of these products are infused with artificial intelligence (AI) capabilities to deliver exceptional customer experience. Sprinklr’s specialized AI models streamline data processing, gather valuable insights, and enable workflows and analytics at scale to drive better decision-making and productivity.

In this post, we describe the scale of our AI offerings, the challenges with diverse AI workloads, and how we optimized mixed AI workload inference performance with AWS Graviton3 based c7g instances and achieved 20% throughput improvement, 30% latency reduction, and reduced our cost by 25–30%.

Sprinklr’s AI scale and challenges with diverse AI workloads

Our purpose-built AI processes unstructured customer experience data from millions of sources, providing actionable insights and improving productivity for customer-facing teams to deliver exceptional experiences at scale. To understand our scaling and cost challenges, let’s look at some representative numbers. Sprinklr’s platform uses thousands of servers that fine-tune and serve over 750 pre-built AI models across over 60 verticals, and run more than 10 billion predictions per day.

To deliver a tailored user experience across these verticals, we deploy patented AI models fine-tuned for specific business applications and use nine layers of machine learning (ML) to extract meaning from data across formats: automatic speech recognition, natural language processing, computer vision, network graph analysis, anomaly detection, trends, predictive analysis, natural language generation, and similarity engine.

The diverse and rich database of models brings unique challenges for choosing the most efficient deployment infrastructure that gives the best latency and performance.

For example, for mixed AI workloads, the AI inference is part of the search engine service with real-time latency requirements. In these cases, the model sizes are smaller, which means the communication overhead with GPUs or ML accelerator instances outweighs their compute performance benefits. Also, inference requests are infrequent, which means accelerators are more often idle and not cost-effective. Therefore, the production instances were not cost-effective for these mixed AI workloads, causing us to look for new instances that offer the right balance of scale and cost-effectiveness.

Cost-effective ML inference using AWS Graviton3

Graviton3 processors are optimized for ML workloads, including support for bfloat16, Scalable Vector Extension (SVE), twice the Single Instruction Multiple Data (SIMD) bandwidth, and 50% more memory bandwidth compared to AWS Graviton2 processors, making them an ideal choice for our mixed workloads. Our goal is to use the latest technologies for efficiency and cost savings, so when AWS released Graviton3-based Amazon Elastic Compute Cloud (Amazon EC2) instances, we were excited to try them in our mixed workloads, especially given our previous Graviton experience. For over 3 years, we have run our search infrastructure on Graviton2-based EC2 instances and our real-time and batched inference workloads on AWS Inferentia ML-accelerated instances, and in both cases we improved latency by 30% and achieved up to 40% price-performance benefits over comparable x86 instances.

To migrate our mixed AI workloads from x86-based instances to Graviton3-based c7g instances, we took a two-step approach. First, we had to experiment and benchmark in order to determine that Graviton3 was indeed the right solution for us. After that was confirmed, we had to perform the actual migration.

First, we started by benchmarking our workloads using the readily available Graviton Deep Learning Containers (DLCs) in a standalone environment. As early adopters of Graviton for ML workloads, it was initially challenging to identify the right software versions and the runtime tunings. During this journey, we collaborated with our AWS technical account manager and the Graviton software engineering teams. We collaborated closely and frequently for the optimized software packages and detailed instructions on how to tune them to achieve optimum performance. In our test environment, we observed 20% throughput improvement and 30% latency reduction across multiple natural language processing models.

After we had validated that Graviton3 met our needs, we integrated the optimizations into our production software stack. The AWS account team assisted us promptly, helping us ramp up quickly to meet our deployment timelines. Overall, migration to Graviton3-based instances was smooth, and it took less than 2 months to achieve the performance improvements in our production workloads.


By migrating our mixed inference/search workloads to Graviton3-based c7g instances from the comparable x86-based instances, we achieved the following:

  • Higher performance – We realized 20% throughput improvement and 30% latency reduction.
  • Reduced cost – We achieved 25–30% cost savings.
  • Improved customer experience – By reducing the latency and increasing throughput, we significantly improved the performance of our products and services, providing the best user experience for our customers.
  • Sustainable AI – Because we saw a higher throughput on the same number of instances, we were able to lower our overall carbon footprint, and we made our products appealing to environmentally conscious customers.
  • Better software quality and maintenance – The AWS engineering team upstreamed all the software optimizations into PyTorch and TensorFlow open source repositories. As a result, our current software upgrade process on Graviton3-based instances is seamless. For example, PyTorch (v2.0+), TensorFlow (v2.9+), and Graviton DLCs come with Graviton3 optimizations and the user guides provide best practices for runtime tuning.

So far, we have migrated PyTorch and TensorFlow based Distil RoBerta-base, spaCy clustering, prophet, and xlmr models to Graviton3-based c7g instances. These models are serving intent detection, text clustering, creative insights, text classification, smart budget allocation, and image download services. These services power our unified customer experience (unified-cxm) platform and conversional AI to allow brands to build more self-serve use cases for their customers. Next, we are migrating ONNX and other larger models to Graviton3-based m7g general purpose and Graviton2-based g5g GPU instances to achieve similar performance improvements and cost savings.


Switching to Graviton3-based instances was quick in terms of engineering time, and resulted in 20% throughput improvement, 30% latency reduction, 25–30% cost savings, improved customer experience, and a lower carbon footprint for our workloads. Based on our experience, we will continue to seek new compute from AWS that will reduce our costs and improve the customer experience.

For further reading, refer to the following:

About the Authors

Sunita Nadampalli is a Software Development Manager at AWS. She leads Graviton software performance optimizations for Machine Learning and HPC workloads. She is passionate about open source software development and delivering high-performance and sustainable software solutions with Arm SoCs.

Gaurav Garg is a Sr. Technical Account Manager at AWS with 15 years of experience. He has a strong operations background. In his role he works with Independent Software Vendors to build scalable and cost-effective solutions with AWS that meet the business requirements. He is passionate about Security and Databases.

Ratnesh Jamidar is a AVP Engineering at Sprinklr with 8 years of experience. He is a seasoned Machine Learning professional with expertise in designing, implementing large-scale, distributed, and highly available AI products and infrastructure.

Vinayak Trivedi is an Associate Director of Engineering at Sprinklr with 4 years of experience in Backend & AI. He is proficient in Applied Machine Learning & Data Science, with a history of building large-scale, scalable and resilient systems.

Read More

How Wiz is empowering organizations to remediate security risks faster with Amazon Bedrock

How Wiz is empowering organizations to remediate security risks faster with Amazon Bedrock

Wiz is a cloud security platform that enables organizations to secure everything they build and run in the cloud by rapidly identifying and removing critical risks. Over 40% of the Fortune 100 trust Wiz’s purpose-built cloud security platform to gain full-stack visibility, accurate risk prioritization, and enhanced business agility. Organizations can connect Wiz in minutes to scan the entire cloud environment without agents and identify the issues representing real risk. Security and cloud teams can then proactively remove risks and harden cloud environments with remediation workflows.

Artificial intelligence (AI) has revolutionized the way organizations function, paving the way for automation and improved efficiency in various tasks that were traditionally manual. One of these use cases is using AI in security organizations to improve security processes and increase your overall security posture. One of the major challenges in cloud security is discerning the best ways to resolve an identified issue in the most effective way to allow you to respond quickly.

Wiz has harnessed the power of generative AI to help organizations remove risks in their cloud environment faster. With Wiz’s new integration with Amazon Bedrock, Wiz customers can now generate guided remediation steps backed by foundation models (FMs) running on Amazon Bedrock to reduce their mean time to remediation (MTTR). Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

“The Wiz and Amazon Bedrock integration enables organizations to further enhance security and improve remediation time by leveraging a choice of powerful foundation models to generate GenAI-powered remediation steps.”

– Vivek Singh, Senior Manager, Product Management-Tech, AWS AI

In this post, we share how Wiz uses Amazon Bedrock to generate remediation guidance for customers that allow them to quickly address security risks in their cloud environment.

Detecting security risks in the cloud with the Wiz Security Graph

Wiz scans cloud environments without agents and runs deep risk assessment across network exposures, vulnerabilities, misconfigurations, identities, data, secrets, and malware. Wiz stores the entire technology stack as well as any risks detected on the Wiz Security Graph, which is backed by Amazon Neptune. Neptune enables Wiz to quickly traverse the graph and understand interconnected risk factors in seconds and how they create an attack path. The Security Graph allows Wiz to surface these critical attack paths in the form of Wiz Issues. For example, a Wiz Issue can alert of a publicly exposed Amazon Elastic Compute Cloud (Amazon EC2) instance that is vulnerable, has admin permissions, and can access sensitive data. The following graph illustrates this attack path.

Attack path

With its Security Graph, Wiz provides customers with pinpoint-accurate alerts on security risks in their environment, reduces the noise faced with traditional security tools, and enables organizations to focus on the most critical risks in their environment.

Remediating cloud risks with guided remediation provided by Amazon Bedrock

To help customers remediate security risks even faster, Wiz uses Amazon Bedrock to analyze metadata from Wiz Issues to generate effective remediation recommendations for customers. With Amazon Bedrock, Wiz combines its deep risk context with cutting-edge FMs to offer enhanced remediation guidance to customers. Customers can scale their remediation workflow and minimize their MTTR by generating straightforward-to-use copy-paste remediation steps that can be directly implemented into the tool of their choice, such as the AWS Command Line Interface (AWS CLI), Terraform, AWS CloudFormation, Pulumi, Go, and Python, or directly using the cloud environment console. The following screenshot showcases an example of the remediation steps generated by Amazon Bedrock for a Wiz Issue.

An example of the remediation steps generated by Amazon Bedrock for a Wiz Issue

Wiz sends a prompt with all the relevant context around a security risk to Amazon Bedrock with instructions on how to present the results based on the target platform. Amazon Bedrock native APIs allow Wiz to select the best model for the use case to answer the request, so when it’s received, it’s parsed and presented in a straightforward manner in the Wiz portal.

To fully operationalize this functionality in production, the Wiz backend has a service running on Amazon Elastic Kubernetes Service (Amazon EKS) that receives the customer request to generate remediation steps, collects the context of the alert the customer wants to remediate, and runs personally identifiable information (PII) redaction on the data to remove any sensitive data. Then, another service running on Amazon EKS pulls the resulting data and sends it to Amazon Bedrock. Such a flow can run in each needed AWS Region supported by Amazon Bedrock to address any compliance needs of their customers. In addition, to secure the usage of Amazon Bedrock with least privilege, Wiz uses AWS permission sets and follows AWS best practices. The Wiz service sending the prompt to Amazon Bedrock has a dedicated AWS Identity and Access Management (IAM) role that allows it to communicate only with the specific Amazon Bedrock service and to only generate those requests. Amazon Bedrock also has restrictions to block any data coming from a non-authorized service. Using these AWS services and the Wiz Security Graph, Wiz helps its customers adopt the most advanced LLMs to speed up the process of addressing complex security issues in a straightforward and secure manner. The following diagram illustrates this architecture.

System architecture

Wiz customers are already experiencing the advantages of our new AI-driven remediation:

“The faster we can remediate security risks, the more we can focus on driving broader strategic initiatives. With Wiz’s AI-powered remediation, we can quickly generate remediation steps that our security team and developers can simply copy-paste to remediate the issue.”

– Rohit Kohli, Deputy CISO, Genpact

By using Amazon Bedrock for generating AI-powered remediation steps, we learnt that security teams are able to minimize the time spent investigating complex risks by 40%, allowing them to focus on mitigating more risks. Furthermore, they are able to empower developers to remediate risks by removing the need for security expertise and providing them with exact steps to take. Not only does Wiz use AI to enhance security processes for customers, but it also makes it effortless for customers to securely adopt AI in their organization with its AI Security Posture Management capabilities, empowering them to protect their AI models while increasing innovation.


Using generative AI for generating enhanced remediation steps marks a significant advancement in the realm of problem-solving and automation. By harnessing the power of AI models powered by Amazon Bedrock, Wiz users can quickly remediate risks with straightforward remediation guidance, reducing manual efforts and improving MTTR. Learn more about Wiz and check out a live demo.

About the Authors

Shaked RotleviShaked Rotlevi is a Technical Product Marketing Manager at Wiz focusing on AI security. Prior to Wiz she was a Solutions Architect at AWS working with public sector customers as well as a Technical Program Manager for a security service team. In her spare time she enjoys playing beach volleyball and hiking.

Itay ArbelItay Arbel is a Lead Product Manager at Wiz. Before joining Wiz, Itay was a product manager at Microsoft and did an MBA in Oxford University, majoring in high tech and emerging technologies. Itay is Wiz’s product lead for the effort of helping organizations securing their AI pipeline and usage of this new emerging technology.

Eitan SelaEitan Sela is a Generative AI and Machine Learning Specialist Solutions Architect at AWS. He works with AWS customers to provide guidance and technical assistance, helping them build and operate Generative AI and Machine Learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Adi AvniAdi Avni is a Senior Solutions Architect at AWS based in Israel. Adi works with AWS ISV customers, helping them to build innovative, scalable and cost-effective solutions on AWS. In his spare time, he enjoys sports and traveling with family and friends.

Read More

PyTorch Foundation Welcomes New Executive Director

PyTorch Foundation Welcomes New Executive Director

Matt White
The PyTorch Foundation is excited to welcome Matt White, our new executive director. The PyTorch Foundation formed in 2022 with the goal to drive adoption of AI tooling by fostering and sustaining an ecosystem of open source, vendor-neutral projects with PyTorch. Over the past 2 years, we’ve seen excellent growth across the project – with both contributor and member growth.

“I am honored to be a part of the PyTorch Foundation, working with such a passionate and skilled community,” said Matt White. “I am looking forward to working with our contributors and members to advance the PyTorch ecosystem through research, cutting edge technologies and open source best practices.”

Matt is a career technologist, researcher and innovator and has over 25 years of experience in AI, data, autonomous systems and simulations. He is the Co-founder and Chair of the Open Metaverse Foundation, a part of the Linux Foundation. Previously, Matt was the Director of the Generative AI Commons at the Linux Foundation, leading the advancement of open science and open-source artificial intelligence projects. He is also the GM of AI at the Linux Foundation.

Learn more about the PyTorch Foundation:

Read More

LST-Bench: A new benchmark tool for open table formats in the data lake

LST-Bench: A new benchmark tool for open table formats in the data lake

This paper was presented at the ACM SIGMOD/Principles of Database Systems Conference (opens in new tab) (SIGMOD/PODS 2024), the premier forum on large-scale data management and databases.

SIGMOD PODS 2024 logo to the left of the first page of

As organizations grapple with ever-expanding datasets, the adoption of data lakes has become a vital strategy for scalable and cost-effective data management. The success of these systems largely depends on the file formats used to store the data. Traditional formats, while efficient in data compression and organization, falter with frequent updates. Advanced table formats like Delta Lake, Apache Iceberg, and Apache Hudi offer promising solutions with easier data modifications and historical tracking, yet their efficacy lies in their ability to handle continuous updates, a challenge that requires extensive and thorough evaluation.

Our paper, “LST-Bench: Benchmarking Log-Structured Tables in the Cloud (opens in new tab),” presented at SIGMOD 2024, introduces an innovative tool designed to evaluate the performance of different table formats in the cloud. LST-Bench builds on the well-established TPC-DS (opens in new tab) benchmark—which measures how efficiently systems handle large datasets and complex queries—and includes features specifically designed for table formats, simplifying the process of testing them under real-world conditions. Additionally, it automatically conducts tests and collects essential data from both the computational engine and various cloud services, enabling accurate performance evaluation.

Flexible and adaptive testing

Designed for flexibility, LST-Bench adapts to a broad range of scenarios, as illustrated in Figure 1. The framework was developed by incorporating insights from engineers, facilitating the integration of existing workloads like TPC-DS, while promoting reusability. For example, each test session establishes a new connection to the data-processing engine, organizing tasks as a series of statements. This setup permits developers to run multiple tasks either sequentially within a single session or concurrently across various sessions, reflecting real-world application patterns.

A diagram showing workload components in LST-Bench and their relationships.
Figure 1. Workload components in LST-Bench and their relationships. A task is a sequence of SQL statements, while a session is a sequence of tasks that represents a logical unit of work or a user session. A phase is a group of concurrent sessions that must be completed before the next phase can start. Lastly, a workload is a sequence of phases.

The TPC-DS workload comprises the following foundational tasks:

  • Load task: Loads data into tables for experimentation.
  • Single User task: Executes complex queries to test the engine’s upper performance limit.
  • Data Maintenance task: Handles data insertions and deletions.

LST-Bench introduces the following tasks specific to table formats:

  • Optimize task: Compacts the data files within a table.
  • Time Travel task: Enables querying data as it appeared at a specified point in the past.
  • Parameterized Custom task: Allows for the integration of user-defined code to create dynamic workflows.

These features enable LST-Bench to evaluate aspects of table formats that are not covered by TPC-DS, providing deeper insights into their performance, as shown in Figure 2.

A diagram illustrating various LST-Bench tasks combined to create workloads that provide insights into table formats. The workloads assess the handling of frequent data modifications over time, optimizing tables for multiple modifications of varying sizes, managing simultaneous reading and writing sessions, querying data across different time points, and evaluating the impact of batch size variations on read query performance.
Figure 2. LST-Bench expands on TPC-DS by introducing a flexible workload representation and incorporating extensions that help users gain insights into table formats previously overlooked by the original benchmark.

A degradation rate metric to measure stability

In addition to these workload extensions, LST-Bench introduces new metrics to evaluate table formats both comprehensively and fairly. It retains the traditional metric categories like performance, storage, and compute efficiency, and it adds a new stability metric called degradation rate. This new metric specifically addresses the impact of accumulating small files in the data lake—a common issue arising from frequent, small updates—providing an assessment of the system’s efficiency over time.

The degradation rate is calculated by dividing a workload into different phases. The degradation rate (S_{DR}) is defined as follows:

(S_{DR}={1over n}sumlimits_{i=1}^ndfrac{M_{i} – M_{i-1}}{M_{i-1}})

Here, (M_i) represents the performance or efficiency metric value of the (i^{th}) iteration of a workload phase, and (n) reflects the total number of iterations of that phase. Intuitively, (S_{DR}) is the rate at which a metric grows or shrinks, reflecting cumulative effects of changes in the underlying system’s state. This rate provides insight into how quickly a system degrades over time. A stable system demonstrates a low (S_{DR}), indicating minimal degradation.

LST-Bench implementation

The LST-Bench features a Java-based client application that runs SQL workloads on various engines, enabling users to define tasks, sessions, and phase libraries to reuse different workload components. This allows them to reference these libraries in their workload definitions, add new task templates, or create entirely new task libraries to model-specific scenarios.

LST-Bench also includes a processing module that consolidates experimental results and calculates metrics to provide insights into table formats and engines. It uses both internal telemetry from LST-Bench and external telemetry from cloud services, such as resource utilization, storage API calls, and network I/O volume. The metrics processor offers multiple visualization options, including notebooks and a web app, to help users analyze performance data effectively.

An illustration depicting the components and execution model of the LST-Bench tool. The Client Application establishes connections with engines via dedicated drivers, while the Metrics Processor gathers telemetry from the Client Application, engines, and other cloud services. This data is aggregated and visualized using either a notebook or web application.
Figure 3. The LST-Bench tool components and execution model.

Implications and looking ahead

LST-Bench integrates seamlessly into the testing workflows of the Microsoft Fabric (opens in new tab) warehouse, allowing that team to rigorously assess engine performance, evaluate releases, and identify any issues. This leads to a more reliable and optimized user experience on the Microsoft Fabric data analytics platform. Additionally, LST-Bench holds promise as a foundational tool for various Microsoft initiatives. It’s currently instrumental in research projects focused on improving data organization for table formats, with the goal of increasing the performance of customer workloads on Microsoft Fabric. LST-Bench is also being used to evaluate the performance of table formats converted using Apache XTable (Incubating) (opens in new tab), an open-source tool designed to prevent data silos within data lakes.

LST-Bench is open source (opens in new tab), and we welcome contributors to help expand this tool, making it highly effective for organizations to thoroughly evaluate their table formats.

Microsoft Research Blog

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.


We would like to thank Joyce Cahoon (opens in new tab) and Yiwen Zhu (opens in new tab) for their valuable discussions on the stability metric, and Jose Medrano (opens in new tab) and Emma Rose Wirshing (opens in new tab) for their feedback on LST-Bench and their work on integrating it with the Microsoft Fabric Warehouse.

The post LST-Bench: A new benchmark tool for open table formats in the data lake appeared first on Microsoft Research.

Read More

Code generation using Code Llama 70B and Mixtral 8x7B on Amazon SageMaker

Code generation using Code Llama 70B and Mixtral 8x7B on Amazon SageMaker

In the ever-evolving landscape of machine learning and artificial intelligence (AI), large language models (LLMs) have emerged as powerful tools for a wide range of natural language processing (NLP) tasks, including code generation. Among these cutting-edge models, Code Llama 70B stands out as a true heavyweight, boasting an impressive 70 billion parameters. Developed by Meta and now available on Amazon SageMaker, this state-of-the-art LLM promises to revolutionize the way developers and data scientists approach coding tasks.

What is Code Llama 70B and Mixtral 8x7B?

Code Llama 70B is a variant of the Code Llama foundation model (FM), a fine-tuned version of Meta’s renowned Llama 2 model. This massive language model is specifically designed for code generation and understanding, capable of generating code from natural language prompts or existing code snippets. With its 70 billion parameters, Code Llama 70B offers unparalleled performance and versatility, making it a game-changer in the world of AI-assisted coding.

Mixtral 8x7B is a state-of-the-art sparse mixture of experts (MoE) foundation model released by Mistral AI. It supports multiple use cases such as text summarization, classification, text generation, and code generation. It is an 8x model, which means it contains eight distinct groups of parameters. The model has about 45 billion total parameters and supports a context length of 32,000 tokens. MoE is a type of neural network architecture that consists of multiple experts” where each expert is a neural network. In the context of transformer models, MoE replaces some feed-forward layers with sparse MoE layers. These layers have a certain number of experts, and a router network selects which experts process each token at each layer. MoE models enable more compute-efficient and faster inference compared to dense models.

Key features and capabilities of Code Llama 70B and Mixtral 8x7B include:

  1. Code generation: These LLMs excel at generating high-quality code across a wide range of programming languages, including Python, Java, C++, and more. They can translate natural language instructions into functional code, streamlining the development process and accelerating project timelines.
  2. Code infilling: In addition to generating new code, they can seamlessly infill missing sections of existing code by providing the prefix and suffix. This feature is particularly valuable for enhancing productivity and reducing the time spent on repetitive coding tasks.
  3. Natural language interaction: The instruct variants of Code Llama 70B and Mixtral 8x7B support natural language interaction, allowing developers to engage in conversational exchanges to develop code-based solutions. This intuitive interface fosters collaboration and enhances the overall coding experience.
  4. Long context support: With the ability to handle context lengths of up to 48 thousand tokens, Code Llama 70B can maintain coherence and consistency over extended code segments or conversations, ensuring relevant and accurate responses. Mixtral 8x7B has a context window of 32 thousand tokens.
  5. Multi-language support: While both of these models excel at generating code, their capabilities extend beyond programming languages. They can also assist with natural language tasks, such as text generation, summarization, and question answering, making them versatile tools for various applications.

Harnessing the power of Code Llama 70B and Mistral models on SageMaker

Amazon SageMaker, a fully managed machine learning service, provides a seamless integration with Code Llama 70B, enabling developers and data scientists to use its capabilities with just a few clicks. Here’s how you can get started:

  1. One-click deployment: Code Llama 70B and Mixtral 8x7B are available in Amazon SageMaker JumpStart, a hub that provides access to pre-trained models and solutions. With a few clicks, you can deploy them and create a private inference endpoint for your coding tasks.
  2. Scalable infrastructure: The SageMaker scalable infrastructure ensures that foundation models can handle even the most demanding workloads, allowing you to generate code efficiently and without delays.
  3. Integrated development environment: SageMaker provides a seamless integrated development environment (IDE) that you can use to interact with these models directly from your coding environment. This integration streamlines the workflow and enhances productivity.
  4. Customization and fine-tuning: While Code Llama 70B and Mixtral 8x7B are powerful out-of-the-box models, you can use SageMaker to fine-tune and customize a model to suit your specific needs, further enhancing its performance and accuracy.
  5. Security and compliance: SageMaker JumpStart employs multiple layers of security, including data encryption, network isolation, VPC deployment, and customizable inference, to ensure the privacy and confidentiality of your data when working with LLMs

Solution overview

The following figure showcases how code generation can be done using the Llama and Mistral AI Models on SageMaker presented in this blog post.

You first deploy a SageMaker endpoint using an LLM from SageMaker JumpStart. For the examples presented in this article, you either deploy a Code Llama 70 B or a Mixtral 8x7B endpoint. After the endpoint has been deployed, you can use it to generate code with the prompts provided in this article and the associated notebook, or with your own prompts. After the code has been generated with the endpoint, you can use a notebook to test the code and its functionality.


In this section, you sign up for an AWS account and create an AWS Identity and Access Management (IAM) admin user.

If you’re new to SageMaker, we recommend that you read What is Amazon SageMaker?.

Use the following hyperlinks to finish setting up the prerequisites for an AWS account and Sagemaker:

  1. Create an AWS Account: This walks you through setting up an AWS account
  2. When you create an AWS account, you get a single sign-in identity that has complete access to all of the AWS services and resources in the account. This identity is called the AWS account root user.
  3. Signing in to the AWS Management Console using the email address and password that you used to create the account gives you complete access to all of the AWS resources in your account. We strongly recommend that you not use the root user for everyday tasks, even the administrative ones.
  4. Adhere to the security best practices in IAM, and Create an Administrative User and Group. Then securely lock away the root user credentials and use them to perform only a few account and service management tasks.
  5. In the console, go to the SageMaker console andopen the left navigation pane.
    1. Under Admin configurations, choose Domains.
    2. Choose Create domain.
    3. Choose Set up for single user (Quick setup). Your domain and user profile are created automatically.
  6. Follow the steps in Custom setup to Amazon SageMaker to set up SageMaker for your organization.

With the prerequisites complete, you’re ready to continue.

Code generation scenarios

The Mixtral 8x7B and Code Llama 70B models requires an ml.g5.48xlarge instance. SageMaker JumpStart provides a simplified way to access and deploy over 100 different open source and third-party foundation models. In order to deploy an endpoint using SageMaker JumpStart, you might need to request a service quota increase to access an ml.g5.48xlarge instance for endpoint use. You can request service quota increases through the AWS console, AWS Command Line Interface (AWS CLI), or API to allow access to those additional resources.

Code Llama use cases with SageMaker

While Code Llama excels at generating simple functions and scripts, its capabilities extend far beyond that. The models can generate complex code for advanced applications, such as building neural networks for machine learning tasks. Let’s explore an example of using Code Llama to create a neural network on SageMaker. Let us start with deploying the Code Llama Model through SageMaker JumpStart.

  1. Launch SageMaker JumpStart
    Sign in to the console, navigate to SageMaker, and launch the SageMaker domain to open SageMaker Studio. Within SageMaker Studio, select JumpStart in the left-hand navigation menu.
  2. Search for Code Llama 70B
    In the JumpStart model hub, search for Code Llama 70B in the search bar. You should see the Code Llama 70B model listed under the Models category.
  3. Deploy the Model
    Select the Code Llama 70B model, and then choose Deploy. Enter an endpoint name (or keep the default value) and select the target instance type (for example, ml.g5.48xlarge). Choose Deploy to start the deployment process. You can leave the rest of the options as default.

Additional details on deployment can be found in Code Llama 70B is now available in Amazon SageMaker JumpStart

  1. Create an inference endpoint
    After the deployment is complete, SageMaker will provide you with an inference endpoint URL. Copy this URL to use later.
  2. Set set up your development environment
    You can interact with the deployed Code Llama 70B model using Python and the AWS SDK for Python (Boto3). First, make sure you have the required dependencies installed: pip install boto3

Note: This blog post section contains code that was generated with the assistance of Code Llama70B powered by Amazon Sagemaker.

Generating a transformer model for natural language processing

Let us walk through a code generation example with Code Llama 70B where you will generate a transformer model in python using Amazon SageMaker SDK.


<<SYS>>You are an expert code assistant that can teach a junior developer how to code. Your language of choice is Python. Don't explain the code, just generate the code block itself. Always use Amazon SageMaker SDK for python code generation. Add test case to test the code<</SYS>>

Generate a Python code that defines and trains a Transformer model for text classification on movie dataset. The python code should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.


Code Llama generates a Python script for training a Transformer model on the sample dataset using TensorFlow and Amazon SageMaker.

Code example:
Create a new Python script (for example, code_llama_inference.py) and add the following code. Replace <YOUR_ENDPOINT_NAME> with the actual inference endpoint name provided by SageMaker JumpStart:

import boto3
import json

# Set up the SageMaker client
session = boto3.Session()
sagemaker_client = session.client("sagemaker-runtime")

# Set the inference endpoint URL
endpoint_name = "<YOUR_ENDPOINT_NAME>"

def query_endpoint(payload):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    return response

def print_completion(prompt: str, response: str) -> None:
    bold, unbold = '33[1m', '33[0m'
    print(f"{bold}> Input{unbold}n{prompt}{bold}n> Output{unbold}n{response[0]['generated_text']}n")

# Define the input prompt

prompt = """

<<SYS>>You are an expert code assistant that can teach a junior developer how to code. Your language of choice is Python. Don't explain the code, just generate the code block itself. Always use Amazon SageMaker SDK for python code generation. Add test case to test the code<</SYS>>
Generate a Python code that defines and trains a Transformer model for text classification on movie dataset. The python code should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.

# Send the request to the endpoint and decode the response

payload = {"inputs": prompt,
"parameters": {"max_new_tokens": 2000, "temperature": 0.2, "top_p": 0.9},

response = query_endpoint(payload)

print_completion(prompt, response)

Save the script and run it:

python code_llama_inference.py

The script will send the provided prompt to the Code Llama 70B model deployed on SageMaker, and the model’s response will be printed to the output.

Example output:


You are an expert AI assistant specializing in generating Python code for natural language processing (NLP) tasks on Amazon SageMaker. Your code should follow best practices, be well-documented, and include examples of usage.

Generate a Python script that defines and trains a Transformer model for text classification on movie dataset. The script should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.

> Output

import os
import json
import sagemaker
import boto3
import numpy as np
import pandas as pd
from sagemaker.tensorflow import TensorFlow
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups

# Set up SageMaker session
sagemaker_session = sagemaker.Session()
region = boto3.Session().region_name
bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-xSum"

# Download dataset
train = fetch_20newsgroups(subset="train")
test = fetch_20newsgroups(subset="test")

# Prepare dataset
train_data = pd.DataFrame(
{"text": train.data, "label": train.target.astype(np.int32)}
test_data = pd.DataFrame({"text": test.data, "label": test.target.astype(np.int32)})

# Split dataset
train_data, validation_data = train_test_split(train_data, test_size=0.2)

# Upload dataset to S3
train_input = sagemaker_session.upload_data(
path=train_data.to_csv(header=False, index=False),
key_prefix=os.path.join(prefix, "train"),
validation_input = sagemaker_session.upload_data(
path=validation_data.to_csv(header=False, index=False),
key_prefix=os.path.join(prefix, "validation"),
test_input = sagemaker_session.upload_data(
path=test_data.to_csv(header=False, index=False),
key_prefix=os.path.join(prefix, "test"),

# Define hyperparameters
hyperparameters = {
"epochs": 1,
"batch_size": 64,
"max_seq_length": 256,
"model_dir": "/opt/ml/model",
"train_dir": "/opt/ml/input/data/train",
"validation_dir": "/opt/ml/input/data/validation",
"test_dir": "/opt/ml/input/data/test",
"train_filename": "train.csv",
"validation_filename": "validation.csv",
"test_filename": "test.csv",
"freeze_bert_layer": True,
"learning_rate": 0.0001,
"epsilon": 1e-08,
"bert_trainable_layers": None,
"bert_layer": -2,
"lambda_regularizer": 0.01,
"gradient_accumulation_steps": 2,
"steps_per_epoch": 100,
"validation_steps": 10,
"test_steps": 10,
"output_dir": "/opt/ml/output",

# Define estimator
estimator = TensorFlow(

# Train model
"train": train_input,
"validation": validation_input,
"test": test_input,

# Deploy model
predictor = estimator.deploy(
initial_instance_count=1, instance_type="ml.p3.2xlarge", wait=True

# Predict
predictor.predict(json.dumps({"instances": [test_data.iloc[0]["text"]]}))

You can modify the prompt variable to request different code generation tasks or engage in natural language interactions with the model.

This example demonstrates how to deploy and interact with the Code Llama 70B model on SageMaker JumpStart using Python and the AWS SDK. Because the model might be prone to minor errors in generating the response output, make sure you run the code. Further, you can instruct the model to fact-check the output and refine the model response in order to fix any other unnecessary errors in the code. With this setup, you can leverage the powerful code generation capabilities of Code Llama 70B within your development workflows, streamlining the coding process and unlocking new levels of productivity. Lets take a look at some additional examples.

Additional examples and use cases

Let’s walk through some other complex code generation scenarios. In the following sample, we’re running the script to generate a Deep Q reinforcement learning (RL) agent for playing the CartPole-v0 environment.

Generating a reinforcement learning agent

The following prompt was tested on Code Llama 70B to generate a Deep Q RL agent adept in playing CartPole-v0 environment.


You are a skilled AI assistant capable of generating Python code for reinforcement learning tasks on Amazon SageMaker. Your code should be efficient, well-documented, and include examples of usage.

Could you please generate a Python script that implements a Deep Q-Network (DQN) agent for playing the CartPole-v1 environment? The script should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.

Response: Code Llama generates a Python script for training a DQN agent on the CartPole-v1 environment using TensorFlow and Amazon SageMaker as showcased in our GitHub repository.

Generating a distributed training script

In this scenario, you will generate a sample python code for distributed machine learning training on Amazon SageMaker using Code Llama 70B.


You are an expert AI assistant skilled in generating Python code for distributed machine learning training on Amazon SageMaker. Your code should be optimized for performance, follow best practices, and include examples of usage.

Could you please generate a Python script that performs distributed training of a deep neural network for image classification on the ImageNet dataset? The script should use Amazon SageMaker's PyTorch estimator with distributed data parallelism and be ready for deployment on SageMaker.

Response: Code Llama generates a Python script for distributed training of a deep neural network on the ImageNet dataset using PyTorch and Amazon SageMaker. Additional details are available in our GitHub repository.

Mixtral 8x7B use cases with SageMaker

Compared to traditional LLMs, Mixtral 8x7B offers the advantage of faster decoding at the speed of a smaller, parameter-dense model despite containing more parameters. It also outperforms other open-access models on certain benchmarks and supports a longer context length.

  1. Launch SageMaker JumpStart
    Sign in to the console, navigate to SageMaker, and launch the SageMaker domain to open SageMaker Studio. Within SageMaker Studio, select JumpStart in the left-hand navigation menu.
  2. Search for Mixtral 8x7B Instruct
    In the JumpStart model hub, search for Mixtral 8x7B Instruct in the search bar. You should see the Mixtral 8x7B Instruct model listed under the Models category.
  3. Deploy the Model
    Select the Code Llama 70B model, and then choose Deploy. Enter an endpoint name (or keep the default value) and choose the target instance type (for example, ml.g5.48xlarge). Choose Deploy to start the deployment process. You can leave the rest of the options as default.

Additional details on deployment can be found in Mixtral-8x7B is now available in Amazon SageMaker JumpStart.

  1. Create an inference endpoint
    After the deployment is complete, SageMaker will provide you with an inference endpoint URL. Copy this URL to use later.

Generating a hyperparameter tuning script for SageMaker

Hyperparameters are external configuration variables that data scientists use to manage machine learning model training. Sometimes called model hyperparameters, the hyperparameters are manually set before training a model. They’re different from parameters, which are internal parameters automatically derived during the learning process and not set by data scientists. Hyperparameters directly control model structure, function, and performance.

When you build complex machine learning systems like deep learning neural networks, exploring all the possible combinations is impractical. Hyperparameter tuning can accelerate your productivity by trying many variations of a model. It looks for the best model automatically by focusing on the most promising combinations of hyperparameter values within the ranges that you specify. To get good results, you must choose the right ranges to explore.

SageMaker automatic model tuning (AMT) finds the best version of a model by running many training jobs on your dataset. To do this, AMT uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that creates a model that performs the best, as measured by a metric that you choose.

Note: This blog post section contains code that was generated with the assistance of Mixtral 8X7B model, powered by Amazon Sagemaker.


You are an AI assistant specializing in generating Python code for hyperparameter tuning of machine learning models on Amazon SageMaker. Your code should be efficient, well-documented, and include examples of usage.

Generate a Python script that performs hyperparameter tuning of an XGBoost model for regression on the Boston Housing dataset. The script should use Amazon SageMaker's XGBoost estimator and be ready for deployment on SageMaker. It should optimize for validation:accuracy, and optimize the hyperparameters alpha and eta.



Here's a Python script that performs hyperparameter tuning of an XGBoost model for regression on the Boston Housing dataset using Amazon SageMaker. This script uses the XGBoost estimator and optimizes for validation:accuracy while tuning the hyperparameters `alpha` and `eta`.

import boto3
import numpy as np
import pandas as pd
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

# Set up SageMaker session and specify the IAM role
sagemaker_session = boto3.Session(region_name='us-west-2')
role = sagemaker_session.get_execution_role()

# Specify the SageMaker XGBoost image version
xgboost_image = 'xgboost:1.3-1'

# Define the hyperparameter tuning configuration
hyperparameter_ranges = {
    'alpha': ContinuousParameter(0.1, 1.0),
    'eta': ContinuousParameter(0.01, 0.3)

objective_metric_name = 'validation:accuracy'
metric_definitions = [
        'Name': 'validation:accuracy',
        'Regex': 'validation-accuracy: ([0-9\.]+)'

tuner = HyperparameterTuner(
        hyperparameters={'objective': 'reg:squarederror', 'num_boost_round': 100},

# Load the Boston Housing dataset
boston_data = pd.
train_data = boston_data.sample(frac=0.8, random_state=42)
test_data = boston_data.drop(train_data.index)

train_input_path = sagemaker_session.upload_data(path=train_data, key_prefix='xgboost-tuner/input/train')
test_input_path = sagemaker_session.upload_data(path=test_data, key_prefix='xgboost-tuner/input/test')

# Set up the training and validation datasets
train_channel = sagemaker_session.input_channel(name='train', path=train_input_path)
test_channel = sagemaker_session.input_channel(name='test', path=test_input_path)

# Define the SageMaker XGBoost estimator for training
xgboost_estimator = XGBoost(
    hyperparameters={'objective': 'reg:squarederror', 'num_boost_round': 100},

# Perform hyperparameter tuning
tuner.fit({'train': train_channel, 'test': test_channel}, include_cls_metadata=False)

# Get the best hyperparameters and create a new XGBoost estimator
best_training_job = tuner.best_training_job()
best_hyperparameters = tuner.best_hyperparameters
print(f'Best hyperparameters: {best_hyperparameters}')

new_xgboost_estimator = XGBoost(

# Train the model with the best hyperparameters
new_xgboost_estimator.fit({'train': train_channel, 'test': test_channel}, include_cls_metadata=False)

# Save the trained model for deployment
model_data = new_xgboost_estimator.create_model()
model_name = 'xgboost-tuner-boston-model'
sagemaker_session.create_model(model_name=model_name, model_data=model_data)

# Deploy the model
predictor = sagemaker_session.create_predictor(model_name,

Replace `'boston_housing.csv'` with the path to your Boston Housing dataset. This script assumes that the dataset has columns named `'CRIM'`, `'ZN'`, `'INDUS'`, `'CHAS'`, `'NOX'`, `'RM'`, `'AGE'`, `'DIS'`, `'RAD'`, `'TAX'`, `'PTRATIO'`, `'B'`, `'LSTAT'`, and `'MEDV'`.

To run this script, you need to have the AWS CLI and Boto3 installed. Make sure to configure your AWS credentials using `aws configure`.

The script also requires a file named `inference.py` in the `xgboost-tuner` directory for deployment. Here's a simple example of `inference.py`:

import json
import xgboost as xgb
import numpy as np

def model_fn(model_dir):
    model = xgb.Booster(model_file=os.path.join(model_dir, 'model.joblib'))
    return model

def input_fn(request_body, content_type):
    if content_type == 'text/csv':
        features = np.loadtxt(pd.compat.StringIO(request_body), delimiter=',')
        return {'features': features.reshape(1, -1)}
    raise ValueError("Request body must be csv-formatted.")

def predict_fn(input_data, model):
    features = input_data['features']
    prediction = model.predict(features)
    return {'prediction': float(prediction)}

Make sure to install the required libraries (`boto3`, `numpy`, `pandas`, `xgboost`, and `sagemaker`) before running the script.

Code Transformation: Java to Python

There are instances where users need to convert code written in one programing language to another. This is known as a cross-language transformation task, and foundation models can help automate the process.


You are a programmer specialized in Java and Python languages. Your code should be efficient, well-documented, and include examples of usage.
Convert the following Java code to its equivalent in Python:

import java.util.ArrayList;

public class Main {
public static void main(String[] args) {
ArrayList<String> cars = new ArrayList<String>();
for (int i = 0; i < cars.size(); i++) {


import ArrayList from java.util import ArrayList

def main():
# Create an ArrayList of strings
cars = ArrayList[str]()

# Add elements to the ArrayList

# Iterate over the ArrayList and print each element
for i in range(cars.size()):

if __name__ == "__main__":

Note: This Python code uses the Java ArrayList class, which is part of the JPype library that allows Python programs to call Java code. 
The code above will only work in an environment that has JPype installed and configured.

For pure Python solution, here is the equivalent code:

def main():
# Create a list of strings
cars = ["Volvo", "BMW", "Ford", "Mazda"]

# Iterate over the list and print each element
for car in cars:

if __name__ == "__main__":

This Python code uses a built-in list data structure instead of the Java ArrayList class. The code above is more idiomatic and efficient in Python.

AWS CDK code for a three-tier web application

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework for defining cloud infrastructure as code with modern programming languages and deploying it through AWS CloudFormation.

The three-tier architecture pattern provides a general framework to ensure decoupled and independently scalable application components can be separately developed, managed, and maintained (often by distinct teams). A three-tier architecture is the most popular implementation of a multi-tier architecture and consists of a single presentation tier, logic tier, and data tier:

  • Presentation tier: Component that the user directly interacts with (for example, webpages and mobile app UIs).
  • Logic tier: Code required to translate user actions to application functionality (for example, CRUD database operations and data processing).
  • Data tier: Storage media (for example, databases, object stores, caches, and file systems) that hold the data relevant to the application.


You are a programmer specialized in AWS, and programming in Python and the AWS Cloud Development Kit. Your code should be efficient, well-documented, and include examples of usage.
Create a CDK script in Python for configuring and deploying a three tier web application in AWS. The architecture should consist of these three tiers:

- Presentation tier: The front-end application is all static content (HTML, CSS, JavaScript and images) which are generated by React utilities like create-react-app. Amazon CloudFront hosts all these objects. The web application, when used, downloads all the resources to the browser and starts to run from there. The web application connects to the backend calling the APIs.

- Logic tier: Logic layer is built using Lambda functions fronted by API Gateway REST APIs. This architecture shows multiple exposed services. There are multiple different Lambda functions each handling a different aspect of the application. The Lambda functions are behind API Gateway and accessible using API URL paths.

- Data tier: DynamoDB is used for the data storage.


Sure, here's an example CDK script in Python that configures and deploys a three-tier web application in AWS:
from aws_cdk import (
aws_apigateway as apigw,
aws_lambda as _lambda,
aws_lambda_event_sources as lambda_event_sources,
aws_logs as logs,
aws_dynamodb as ddb,

class ThreeTierWebAppStack(core.Stack):

def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
super().__init__(scope, id, **kwargs)

# Data tier
self.table = ddb.Table(
self, "DataTable",
partition_key=ddb.Attribute(name="id", type=ddb.AttributeType.STRING),

# Logic tier
self.lambda_function = _lambda.Function(
self, "LambdaFunction",
"TABLE_NAME": self.table.table_name


self.api = apigw.LambdaRestApi(
self, "Api",

# Presentation tier
self.distribution = aws_cloudfront.Distribution(
self, "Distribution",
Here's a brief explanation of the code:

* The `ThreeTierWebAppStack` class extends the `core.Stack` class and is the main entry point for the CDK script.
* The `table` attribute is a DynamoDB table used for data storage.
* The `lambda_function` attribute is a Lambda function that handles the business logic of the application. It has an IAM role that allows it to read and write to the DynamoDB table.
* The `api` attribute is an API Gateway REST API that exposes the Lambda function as a web service.
* The `distribution` attribute is a CloudFront distribution that serves the static content of the application. It uses the API Gateway as the origin and redirects all HTTP traffic to HTTPS.

To use this script, you need to create a `lambda` directory in the same directory as the script and put your Lambda function code in a file named `lambda_function.py` in that directory. You also need to install the AWS CDK and its dependencies using `pip install aws-cdk.core aws-cdk.aws-lambda aws-cdk.aws-lambda_event_sources aws-cdk.aws-logs aws-cdk.aws-dynamodb aws-cdk.aws-apigateway aws-cdk.aws-cloudfront`.

Once you have done that, you can deploy the stack using `cdk deploy`. This will create the DynamoDB table, the Lambda function, the API Gateway, and the CloudFront distribution. The URL of the CloudFront distribution will be printed to the console after the deployment is complete. You can use this URL to access the web application.

Additional considerations

The following are some additional considerations when implementing these models:

  • Different models will produce different results, so you should conduct experiments with different foundation models and different prompts for your use case to achieve the desired results.
  • The analyses provided are not meant to replace human judgement. You should be mindful of potential hallucinations when working with generative AI, and use the analysis only as a tool to assist and speed up code generation.

Clean up

Delete the model endpoints deployed using Amazon SageMaker for Code Llama and Mistral to avoid incurring any additional costs in your account.

Shut down any SageMaker Notebook instances that were created for deploying or running the examples showcased in this blog post to avoid any notebook instance costs associated with the account.


The combination of exceptional capabilities from foundation models like Code Llama 70B and Mixtral 8x7B and the powerful machine learning platform of Sagemaker, presents a unique opportunity for developers and data scientists to revolutionize their coding workflows. The cutting-edge capabilities of FMs empower customers to generate high-quality code, infill missing sections, and engage in natural language interactions, all while using the scalability, security, and compliance of AWS.

The examples highlighted in this blog post demonstrate these models’ advanced capabilities in generating complex code for various machine learning tasks, such as natural language processing, reinforcement learning, distributed training, and hyperparameter tuning, all tailored for deployment on SageMaker. Developers and data scientists can now streamline their workflows, accelerate development cycles, and unlock new levels of productivity in the AWS Cloud.

Embrace the future of AI-assisted coding and unlock new levels of productivity with Code Llama 70B and Mixtral 8x7B on Amazon SageMaker. Start your journey today and experience the transformative power of this groundbreaking language model.


  1. Code Llama 70B is now available in Amazon SageMaker JumpStart
  2. Fine-tune Code Llama on Amazon SageMaker JumpStart
  3. Mixtral-8x7B is now available in Amazon SageMaker JumpStart

About the Authors

Shikhar Kwatra is an AI/ML Solutions Architect at Amazon Web Services based in California. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partners in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.

Jose Navarro is an AI/ML Solutions Architect at AWS based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production. In his spare time, he loves to exercise, spend quality time with friends and family, and catch up on AI news and papers.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Read More

Why Accelerated Data Processing Is Crucial for AI Innovation in Every Industry

Why Accelerated Data Processing Is Crucial for AI Innovation in Every Industry

Across industries, AI is supercharging innovation with machine-powered computation. In finance, bankers are using AI to detect fraud more quickly and keep accounts safe, telecommunications providers are improving networks to deliver superior service, scientists are developing novel treatments for rare diseases, utility companies are building cleaner, more reliable energy grids and automotive companies are making self-driving cars safer and more accessible.

The backbone of top AI use cases is data. Effective and precise AI models require training on extensive datasets. Enterprises seeking to harness the power of AI must establish a data pipeline that involves extracting data from diverse sources, transforming it into a consistent format and storing it efficiently.

Data scientists work to refine datasets through multiple experiments to fine-tune AI models for optimal performance in real-world applications. These applications, from voice assistants to personalized recommendation systems, require rapid processing of large data volumes to deliver real-time performance.

As AI models become more complex and begin to handle diverse data types such as text, audio, images, and video, the need for rapid data processing becomes more critical. Organizations that continue to rely on legacy CPU-based computing are struggling with hampered innovation and performance due to data bottlenecks, escalating data center costs, and insufficient computing capabilities.

Many businesses are turning to accelerated computing to integrate AI into their operations. This method leverages GPUs, specialized hardware, software, and parallel computing techniques to boost computing performance by as much as 150x and increase energy efficiency by up to 42x.

Leading companies across different sectors are using accelerated data processing to spearhead groundbreaking AI initiatives.

Finance Organizations Detect Fraud in a Fraction of a Second

Financial organizations face a significant challenge in detecting patterns of fraud due to the vast amount of transactional data that requires rapid analysis. Additionally, the scarcity of labeled data for actual instances of fraud poses a difficulty in training AI models. Conventional data science pipelines lack the required acceleration to handle the large data volumes associated with fraud detection. This leads to slower processing times that hinder real-time data analysis and fraud detection capabilities.

To overcome these challenges, American Express, which handles more than 8 billion transactions per year, uses accelerated computing to train and deploy long short-term memory (LSTM) models. These models excel in sequential analysis and detection of anomalies, and can adapt and learn from new data, making them ideal for combating fraud.

Leveraging parallel computing techniques on GPUs, American Express significantly speeds up the training of its LSTM models. GPUs also enable live models to process huge volumes of transactional data to make high-performance computations to detect fraud in real time.

The system operates within two milliseconds of latency to better protect customers and merchants, delivering a 50x improvement over a CPU-based configuration. By combining the accelerated LSTM deep neural network with its existing methods, American Express has improved fraud detection accuracy by up to 6% in specific segments.

Financial companies can also use accelerated computing to reduce data processing costs. Running data-heavy Spark3 workloads on NVIDIA GPUs, PayPal confirmed the potential to reduce cloud costs by up to 70% for big data processing and AI applications.

By processing data more efficiently, financial institutions can detect fraud in real time, enabling faster decision-making without disrupting transaction flow and minimizing the risk of financial loss.

Telcos Simplify Complex Routing Operations

Telecommunications providers generate immense amounts of data from various sources, including network devices, customer interactions, billing systems, and network performance and maintenance.

Managing national networks that handle hundreds of petabytes of data every day requires complex technician routing to ensure service delivery. To optimize technician dispatch, advanced routing engines perform trillions of computations, taking into account factors like weather, technician skills, customer requests and fleet distribution. Success in these operations depends on meticulous data preparation and sufficient computing power.

AT&T, which operates one of the nation’s largest field dispatch teams to service its customers, is enhancing data-heavy routing operations with NVIDIA cuOpt, which relies on heuristics, metaheuristics and optimizations to calculate complex vehicle routing problems.

In early trials, cuOpt delivered routing solutions in 10 seconds, achieving a 90% reduction in cloud costs and enabling technicians to complete more service calls daily. NVIDIA RAPIDS, a suite of software libraries that enables acceleration of data science and analytics pipelines, further accelerates cuOpt, allowing companies to integrate local search heuristics and metaheuristics like Tabu search for continuous route optimization.

AT&T is adopting NVIDIA RAPIDS Accelerator for Apache Spark to enhance the performance of Spark-based AI and data pipelines. This has helped the company boost operational efficiency on everything from training AI models to maintaining network quality to reducing customer churn and improving fraud detection. With RAPIDS Accelerator, AT&T is reducing its cloud computing spend for target workloads while enabling faster performance and reducing its carbon footprint.

Accelerated data pipelines and processing will be critical as telcos seek to improve operational efficiency while delivering the highest possible service quality.

Biomedical Researchers Condense Drug Discovery Timelines

As researchers utilize technology to study the roughly 25,000 genes in the human genome to understand their relationship with diseases, there has been an explosion of medical data and peer-reviewed research papers. Biomedical researchers rely on these papers to narrow down the field of study for novel treatments. However, conducting literature reviews of such a massive and expanding body of relevant research has become an impossible task.

AstraZeneca, a leading pharmaceutical company, developed a Biological Insights Knowledge Graph (BIKG) to aid scientists across the drug discovery process, from literature reviews to screen hit rating, target identification and more. This graph integrates public and internal databases with information from scientific literature, modeling between 10 million and 1 billion complex biological relationships.

BIKG has been effectively used for gene ranking, aiding scientists in hypothesizing high-potential targets for novel disease treatments. At NVIDIA GTC, the AstraZeneca team presented a project that successfully identified genes linked to resistance in lung cancer treatments.

To narrow down potential genes, data scientists and biological researchers collaborated to define the criteria and gene features ideal for targeting in treatment development. They trained a machine learning algorithm to search the BIKG databases for genes with the designated features mentioned in literature as treatable. Utilizing NVIDIA RAPIDS for faster computations, the team reduced the initial gene pool from 3,000 to just 40 target genes, a task that previously took months but now takes mere seconds.

By supplementing drug development with accelerated computing and AI, pharmaceutical companies and researchers can finally use the enormous troves of data building up in the medical field to develop novel drugs faster and more safely, ultimately having a life-saving impact.

Utility Companies Build the Future of Clean Energy 

There’s been a significant push to shift to carbon-neutral energy sources in the energy sector. With the cost of harnessing renewable resources such as solar energy falling drastically over the last 10 years, the opportunity to make real progress toward a clean energy future has never been greater.

However, this shift toward integrating clean energy from wind farms, solar farms and home batteries has introduced new complexities in grid management. As energy infrastructure diversifies and two-way power flows must be accommodated, managing the grid has become more data-intensive. New smart grids are now required to handle high-voltage areas for vehicle charging. They must also manage the availability of distributed stored energy sources and adapt to variations in usage across the network.

Utilidata, a prominent grid-edge software company, has collaborated with NVIDIA to develop a distributed AI platform, Karman, for the grid edge using a custom NVIDIA Jetson Orin edge AI module. This custom chip and platform, embedded in electricity meters, transforms each meter into a data collection and control point, capable of handling thousands of data points per second.

Karman processes real-time, high-resolution data from meters at the network’s edge. This enables utility companies to gain detailed insights into grid conditions, predict usage and seamlessly integrate distributed energy resources in seconds, rather than minutes or hours. Additionally, with inference models on edge devices, network operators can anticipate and quickly identify line faults to predict potential outages and conduct preventative maintenance to increase grid reliability.

Through the integration of AI and accelerated data analytics, Karman helps utility providers transform existing infrastructure into efficient smart grids. This allows for tailored, localized electricity distribution to meet fluctuating demand patterns without extensive physical infrastructure upgrades, facilitating a more cost-effective modernization of the grid.

Automakers Enable Safer, More Accessible, Self-Driving Vehicles

As auto companies strive for full self-driving capabilities, vehicles must be able to detect objects and navigate in real time. This requires high-speed data processing tasks, including feeding live data from cameras, lidar, radar and GPS into AI models that make navigation decisions to keep roads safe.

The autonomous driving inference workflow is complex and includes multiple AI models along with necessary preprocessing and postprocessing steps. Traditionally, these steps were handled on the client side using CPUs. However, this can lead to significant bottlenecks in processing speeds, which is an unacceptable drawback for an application where fast processing equates to safety.

To enhance the efficiency of autonomous driving workflows, electric vehicle manufacturer NIO integrated NVIDIA Triton Inference Server into its inference pipeline. NVIDIA Triton is open-source, multi-framework, inference-serving software. By centralizing data processing tasks, NIO reduced latency by 6x in some core areas and increased overall data throughput by up to 5x.

NIO’s GPU-centric approach made it easier to update and deploy new AI models without the need to change anything on the vehicles themselves. Additionally, the company could use multiple AI models at the same time on the same set of images without having to send data back and forth over a network, which saved on data transfer costs and improved performance.

By using accelerated data processing, autonomous vehicle software developers ensure they can reach a high-performance standard to avoid traffic accidents, lower transportation costs and improve mobility for users.

Retailers Improve Demand Forecasting

In the fast-paced retail environment, the ability to process and analyze data quickly is critical to adjusting inventory levels, personalizing customer interactions and optimizing pricing strategies on the fly. The larger a retailer is and the more products it carries, the more complex and compute-intensive its data operations will be.

Walmart, the largest retailer in the world, turned to accelerated computing to significantly improve forecasting accuracy for 500 million item-by-store combinations across 4,500 stores.

As Walmart’s data science team built more robust machine learning algorithms to take on this mammoth forecasting challenge, the existing computing environment began to falter, with jobs failing to complete or generating inaccurate results. The company found that data scientists were having to remove features from algorithms just so they would run to completion.

To improve its forecasting operations, Walmart started using NVIDIA GPUs and RAPIDs. The company now uses a forecasting model with 350 data features to predict sales across all product categories. These features encompass sales data, promotional events, and external factors like weather conditions and major events like the Super Bowl, which influence demand.

Advanced models helped Walmart improve forecast accuracy from 94% to 97% while eliminating an estimated $100 million in fresh produce waste and reducing stockout and markdown scenarios. GPUs also ran models 100x faster with jobs complete in just four hours, an operation that would’ve taken several weeks in a CPU environment.

By shifting data-intensive operations to GPUs and accelerated computing, retailers can lower both their cost and their carbon footprint while delivering best-fit choices and lower prices to shoppers.

Public Sector Improves Disaster Preparedness 

Drones and satellites capture huge amounts of aerial image data that public and private organizations use to predict weather patterns, track animal migrations and observe environmental changes. This data is invaluable for research and planning, enabling more informed decision-making in fields like agriculture, disaster management and efforts to combat climate change. However, the value of this imagery can be limited if it lacks specific location metadata.

A federal agency working with NVIDIA needed a way to automatically pinpoint the location of images missing geospatial metadata, which is essential for missions such as search and rescue, responding to natural disasters and monitoring the environment. However, identifying a small area within a larger region using an aerial image without metadata is extremely challenging, akin to locating a needle in a haystack. Algorithms designed to help with geolocation must address variations in image lighting and differences due to images being taken at various times, dates and angles.

To identify non-geotagged aerial images, NVIDIA, Booz Allen and the government agency collaborated on a solution that uses computer vision algorithms to extract information from image pixel data to scale the image similarity search problem.

When attempting to solve this problem, an NVIDIA solutions architect first used a Python-based application. Initially running on CPUs, processing took more than 24 hours. GPUs supercharged this to just minutes, performing thousands of data operations in parallel versus only a handful of operations on a CPU. By shifting the application code to CuPy, an open-sourced GPU-accelerated library, the application experienced a remarkable 1.8-million-x speedup, returning results in 67 microseconds.

With a solution that can process images and the data of large land masses in just minutes, organizations can gain access to the critical information needed to respond more quickly and effectively to emergencies and plan proactively, potentially saving lives and safeguarding the environment.

Accelerate AI Initiatives and Deliver Business Results

Companies using accelerated computing for data processing are advancing AI initiatives and positioning themselves to innovate and perform at higher levels than their peers.

Accelerated computing handles larger datasets more efficiently, enables faster model training and selection of optimal algorithms, and facilitates more precise results for live AI solutions.

Enterprises that use it can achieve superior price-performance ratios compared to traditional CPU-based systems and enhance their ability to deliver outstanding results and experiences to customers, employees and partners.

Learn how accelerated computing helps organizations achieve AI objectives and drive innovation. 

Read More