June 2024 – Page 11

LST-Bench: A new benchmark tool for open table formats in the data lake

This paper was presented at the ACM SIGMOD/Principles of Database Systems Conference (opens in new tab) (SIGMOD/PODS 2024), the premier forum on large-scale data management and databases.

As organizations grapple with ever-expanding datasets, the adoption of data lakes has become a vital strategy for scalable and cost-effective data management. The success of these systems largely depends on the file formats used to store the data. Traditional formats, while efficient in data compression and organization, falter with frequent updates. Advanced table formats like Delta Lake, Apache Iceberg, and Apache Hudi offer promising solutions with easier data modifications and historical tracking, yet their efficacy lies in their ability to handle continuous updates, a challenge that requires extensive and thorough evaluation.

Our paper, “LST-Bench: Benchmarking Log-Structured Tables in the Cloud (opens in new tab),” presented at SIGMOD 2024, introduces an innovative tool designed to evaluate the performance of different table formats in the cloud. LST-Bench builds on the well-established TPC-DS (opens in new tab) benchmark—which measures how efficiently systems handle large datasets and complex queries—and includes features specifically designed for table formats, simplifying the process of testing them under real-world conditions. Additionally, it automatically conducts tests and collects essential data from both the computational engine and various cloud services, enabling accurate performance evaluation.

Flexible and adaptive testing

Designed for flexibility, LST-Bench adapts to a broad range of scenarios, as illustrated in Figure 1. The framework was developed by incorporating insights from engineers, facilitating the integration of existing workloads like TPC-DS, while promoting reusability. For example, each test session establishes a new connection to the data-processing engine, organizing tasks as a series of statements. This setup permits developers to run multiple tasks either sequentially within a single session or concurrently across various sessions, reflecting real-world application patterns.

A diagram showing workload components in LST-Bench and their relationships. — Figure 1. Workload components in LST-Bench and their relationships. A task is a sequence of SQL statements, while a session is a sequence of tasks that represents a logical unit of work or a user session. A phase is a group of concurrent sessions that must be completed before the next phase can start. Lastly, a workload is a sequence of phases.

The TPC-DS workload comprises the following foundational tasks:

Load task: Loads data into tables for experimentation.
Single User task: Executes complex queries to test the engine’s upper performance limit.
Data Maintenance task: Handles data insertions and deletions.

LST-Bench introduces the following tasks specific to table formats:

Optimize task: Compacts the data files within a table.
Time Travel task: Enables querying data as it appeared at a specified point in the past.
Parameterized Custom task: Allows for the integration of user-defined code to create dynamic workflows.

These features enable LST-Bench to evaluate aspects of table formats that are not covered by TPC-DS, providing deeper insights into their performance, as shown in Figure 2.

A diagram illustrating various LST-Bench tasks combined to create workloads that provide insights into table formats. The workloads assess the handling of frequent data modifications over time, optimizing tables for multiple modifications of varying sizes, managing simultaneous reading and writing sessions, querying data across different time points, and evaluating the impact of batch size variations on read query performance. — Figure 2. LST-Bench expands on TPC-DS by introducing a flexible workload representation and incorporating extensions that help users gain insights into table formats previously overlooked by the original benchmark.

A degradation rate metric to measure stability

In addition to these workload extensions, LST-Bench introduces new metrics to evaluate table formats both comprehensively and fairly. It retains the traditional metric categories like performance, storage, and compute efficiency, and it adds a new stability metric called degradation rate. This new metric specifically addresses the impact of accumulating small files in the data lake—a common issue arising from frequent, small updates—providing an assessment of the system’s efficiency over time.

The degradation rate is calculated by dividing a workload into different phases. The degradation rate (S_{DR}) is defined as follows:

(S_{DR}={1over n}sumlimits_{i=1}^ndfrac{M_{i} – M_{i-1}}{M_{i-1}})

Here, (M_i) represents the performance or efficiency metric value of the (i^{th}) iteration of a workload phase, and (n) reflects the total number of iterations of that phase. Intuitively, (S_{DR}) is the rate at which a metric grows or shrinks, reflecting cumulative effects of changes in the underlying system’s state. This rate provides insight into how quickly a system degrades over time. A stable system demonstrates a low (S_{DR}), indicating minimal degradation.

LST-Bench implementation

The LST-Bench features a Java-based client application that runs SQL workloads on various engines, enabling users to define tasks, sessions, and phase libraries to reuse different workload components. This allows them to reference these libraries in their workload definitions, add new task templates, or create entirely new task libraries to model-specific scenarios.

LST-Bench also includes a processing module that consolidates experimental results and calculates metrics to provide insights into table formats and engines. It uses both internal telemetry from LST-Bench and external telemetry from cloud services, such as resource utilization, storage API calls, and network I/O volume. The metrics processor offers multiple visualization options, including notebooks and a web app, to help users analyze performance data effectively.

An illustration depicting the components and execution model of the LST-Bench tool. The Client Application establishes connections with engines via dedicated drivers, while the Metrics Processor gathers telemetry from the Client Application, engines, and other cloud services. This data is aggregated and visualized using either a notebook or web application. — Figure 3. The LST-Bench tool components and execution model.

Implications and looking ahead

LST-Bench integrates seamlessly into the testing workflows of the Microsoft Fabric (opens in new tab) warehouse, allowing that team to rigorously assess engine performance, evaluate releases, and identify any issues. This leads to a more reliable and optimized user experience on the Microsoft Fabric data analytics platform. Additionally, LST-Bench holds promise as a foundational tool for various Microsoft initiatives. It’s currently instrumental in research projects focused on improving data organization for table formats, with the goal of increasing the performance of customer workloads on Microsoft Fabric. LST-Bench is also being used to evaluate the performance of table formats converted using Apache XTable (Incubating) (opens in new tab), an open-source tool designed to prevent data silos within data lakes.

LST-Bench is open source (opens in new tab), and we welcome contributors to help expand this tool, making it highly effective for organizations to thoroughly evaluate their table formats.

Acknowledgements

We would like to thank Joyce Cahoon (opens in new tab) and Yiwen Zhu (opens in new tab) for their valuable discussions on the stability metric, and Jose Medrano (opens in new tab) and Emma Rose Wirshing (opens in new tab) for their feedback on LST-Bench and their work on integrating it with the Microsoft Fabric Warehouse.

The post LST-Bench: A new benchmark tool for open table formats in the data lake appeared first on Microsoft Research.

Code generation using Code Llama 70B and Mixtral 8x7B on Amazon SageMaker

In the ever-evolving landscape of machine learning and artificial intelligence (AI), large language models (LLMs) have emerged as powerful tools for a wide range of natural language processing (NLP) tasks, including code generation. Among these cutting-edge models, Code Llama 70B stands out as a true heavyweight, boasting an impressive 70 billion parameters. Developed by Meta and now available on Amazon SageMaker, this state-of-the-art LLM promises to revolutionize the way developers and data scientists approach coding tasks.

What is Code Llama 70B and Mixtral 8x7B?

Code Llama 70B is a variant of the Code Llama foundation model (FM), a fine-tuned version of Meta’s renowned Llama 2 model. This massive language model is specifically designed for code generation and understanding, capable of generating code from natural language prompts or existing code snippets. With its 70 billion parameters, Code Llama 70B offers unparalleled performance and versatility, making it a game-changer in the world of AI-assisted coding.

Mixtral 8x7B is a state-of-the-art sparse mixture of experts (MoE) foundation model released by Mistral AI. It supports multiple use cases such as text summarization, classification, text generation, and code generation. It is an 8x model, which means it contains eight distinct groups of parameters. The model has about 45 billion total parameters and supports a context length of 32,000 tokens. MoE is a type of neural network architecture that consists of multiple experts” where each expert is a neural network. In the context of transformer models, MoE replaces some feed-forward layers with sparse MoE layers. These layers have a certain number of experts, and a router network selects which experts process each token at each layer. MoE models enable more compute-efficient and faster inference compared to dense models.

Key features and capabilities of Code Llama 70B and Mixtral 8x7B include:

Code generation: These LLMs excel at generating high-quality code across a wide range of programming languages, including Python, Java, C++, and more. They can translate natural language instructions into functional code, streamlining the development process and accelerating project timelines.
Code infilling: In addition to generating new code, they can seamlessly infill missing sections of existing code by providing the prefix and suffix. This feature is particularly valuable for enhancing productivity and reducing the time spent on repetitive coding tasks.
Natural language interaction: The instruct variants of Code Llama 70B and Mixtral 8x7B support natural language interaction, allowing developers to engage in conversational exchanges to develop code-based solutions. This intuitive interface fosters collaboration and enhances the overall coding experience.
Long context support: With the ability to handle context lengths of up to 48 thousand tokens, Code Llama 70B can maintain coherence and consistency over extended code segments or conversations, ensuring relevant and accurate responses. Mixtral 8x7B has a context window of 32 thousand tokens.
Multi-language support: While both of these models excel at generating code, their capabilities extend beyond programming languages. They can also assist with natural language tasks, such as text generation, summarization, and question answering, making them versatile tools for various applications.

Harnessing the power of Code Llama 70B and Mistral models on SageMaker

Amazon SageMaker, a fully managed machine learning service, provides a seamless integration with Code Llama 70B, enabling developers and data scientists to use its capabilities with just a few clicks. Here’s how you can get started:

One-click deployment: Code Llama 70B and Mixtral 8x7B are available in Amazon SageMaker JumpStart, a hub that provides access to pre-trained models and solutions. With a few clicks, you can deploy them and create a private inference endpoint for your coding tasks.
Scalable infrastructure: The SageMaker scalable infrastructure ensures that foundation models can handle even the most demanding workloads, allowing you to generate code efficiently and without delays.
Integrated development environment: SageMaker provides a seamless integrated development environment (IDE) that you can use to interact with these models directly from your coding environment. This integration streamlines the workflow and enhances productivity.
Customization and fine-tuning: While Code Llama 70B and Mixtral 8x7B are powerful out-of-the-box models, you can use SageMaker to fine-tune and customize a model to suit your specific needs, further enhancing its performance and accuracy.
Security and compliance: SageMaker JumpStart employs multiple layers of security, including data encryption, network isolation, VPC deployment, and customizable inference, to ensure the privacy and confidentiality of your data when working with LLMs

Solution overview

The following figure showcases how code generation can be done using the Llama and Mistral AI Models on SageMaker presented in this blog post.

You first deploy a SageMaker endpoint using an LLM from SageMaker JumpStart. For the examples presented in this article, you either deploy a Code Llama 70 B or a Mixtral 8x7B endpoint. After the endpoint has been deployed, you can use it to generate code with the prompts provided in this article and the associated notebook, or with your own prompts. After the code has been generated with the endpoint, you can use a notebook to test the code and its functionality.

Prerequisites

In this section, you sign up for an AWS account and create an AWS Identity and Access Management (IAM) admin user.

If you’re new to SageMaker, we recommend that you read What is Amazon SageMaker?.

Use the following hyperlinks to finish setting up the prerequisites for an AWS account and Sagemaker:

Create an AWS Account: This walks you through setting up an AWS account
When you create an AWS account, you get a single sign-in identity that has complete access to all of the AWS services and resources in the account. This identity is called the AWS account root user.
Signing in to the AWS Management Console using the email address and password that you used to create the account gives you complete access to all of the AWS resources in your account. We strongly recommend that you not use the root user for everyday tasks, even the administrative ones.
Adhere to the security best practices in IAM, and Create an Administrative User and Group. Then securely lock away the root user credentials and use them to perform only a few account and service management tasks.
In the console, go to the SageMaker console andopen the left navigation pane.
1. Under Admin configurations, choose Domains.
2. Choose Create domain.
3. Choose Set up for single user (Quick setup). Your domain and user profile are created automatically.
Follow the steps in Custom setup to Amazon SageMaker to set up SageMaker for your organization.

With the prerequisites complete, you’re ready to continue.

Code generation scenarios

The Mixtral 8x7B and Code Llama 70B models requires an ml.g5.48xlarge instance. SageMaker JumpStart provides a simplified way to access and deploy over 100 different open source and third-party foundation models. In order to deploy an endpoint using SageMaker JumpStart, you might need to request a service quota increase to access an ml.g5.48xlarge instance for endpoint use. You can request service quota increases through the AWS console, AWS Command Line Interface (AWS CLI), or API to allow access to those additional resources.

Code Llama use cases with SageMaker

While Code Llama excels at generating simple functions and scripts, its capabilities extend far beyond that. The models can generate complex code for advanced applications, such as building neural networks for machine learning tasks. Let’s explore an example of using Code Llama to create a neural network on SageMaker. Let us start with deploying the Code Llama Model through SageMaker JumpStart.

Launch SageMaker JumpStart
Sign in to the console, navigate to SageMaker, and launch the SageMaker domain to open SageMaker Studio. Within SageMaker Studio, select JumpStart in the left-hand navigation menu.
Search for Code Llama 70B
In the JumpStart model hub, search for Code Llama 70B in the search bar. You should see the Code Llama 70B model listed under the Models category.
Deploy the Model
Select the Code Llama 70B model, and then choose Deploy. Enter an endpoint name (or keep the default value) and select the target instance type (for example, ml.g5.48xlarge). Choose Deploy to start the deployment process. You can leave the rest of the options as default.

Additional details on deployment can be found in Code Llama 70B is now available in Amazon SageMaker JumpStart

Create an inference endpoint
After the deployment is complete, SageMaker will provide you with an inference endpoint URL. Copy this URL to use later.
Set set up your development environment
You can interact with the deployed Code Llama 70B model using Python and the AWS SDK for Python (Boto3). First, make sure you have the required dependencies installed: pip install boto3

Note: This blog post section contains code that was generated with the assistance of Code Llama70B powered by Amazon Sagemaker.

Generating a transformer model for natural language processing

Let us walk through a code generation example with Code Llama 70B where you will generate a transformer model in python using Amazon SageMaker SDK.

Prompt:

<s>[INST]
<<SYS>>You are an expert code assistant that can teach a junior developer how to code. Your language of choice is Python. Don't explain the code, just generate the code block itself. Always use Amazon SageMaker SDK for python code generation. Add test case to test the code<</SYS>>

Generate a Python code that defines and trains a Transformer model for text classification on movie dataset. The python code should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.
[/INST]

Response:

Code Llama generates a Python script for training a Transformer model on the sample dataset using TensorFlow and Amazon SageMaker.

Code example:
Create a new Python script (for example, code_llama_inference.py) and add the following code. Replace <YOUR_ENDPOINT_NAME> with the actual inference endpoint name provided by SageMaker JumpStart:

import boto3
import json

# Set up the SageMaker client
session = boto3.Session()
sagemaker_client = session.client("sagemaker-runtime")

# Set the inference endpoint URL
endpoint_name = "<YOUR_ENDPOINT_NAME>"

def query_endpoint(payload):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload).encode('utf-8'),
    )
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    return response

def print_completion(prompt: str, response: str) -> None:
    bold, unbold = '33[1m', '33[0m'
    print(f"{bold}> Input{unbold}n{prompt}{bold}n> Output{unbold}n{response[0]['generated_text']}n")

# Define the input prompt 
prompt = """ 
<s>[INST]
<<SYS>>You are an expert code assistant that can teach a junior developer how to code. Your language of choice is Python. Don't explain the code, just generate the code block itself. Always use Amazon SageMaker SDK for python code generation. Add test case to test the code<</SYS>>
 
Generate a Python code that defines and trains a Transformer model for text classification on movie dataset. The python code should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.
[/INST] """  

# Send the request to the endpoint and decode the response 
payload = {"inputs": prompt, "parameters": {"max_new_tokens": 2000, "temperature": 0.2, "top_p": 0.9}, }  
response = query_endpoint(payload) 
print_completion(prompt, response)

Save the script and run it:

python code_llama_inference.py

The script will send the provided prompt to the Code Llama 70B model deployed on SageMaker, and the model’s response will be printed to the output.

Example output:

Input

<s>[INST]
<<SYS>>
You are an expert AI assistant specializing in generating Python code for natural language processing (NLP) tasks on Amazon SageMaker. Your code should follow best practices, be well-documented, and include examples of usage.
<</SYS>>

Generate a Python script that defines and trains a Transformer model for text classification on movie dataset. The script should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.
[/INST]

> Output

[RESP]
import os
import json
import sagemaker
import boto3
import numpy as np
import pandas as pd
from sagemaker.tensorflow import TensorFlow
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups

# Set up SageMaker session
sagemaker_session = sagemaker.Session()
region = boto3.Session().region_name
bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-xSum"

# Download dataset
train = fetch_20newsgroups(subset="train")
test = fetch_20newsgroups(subset="test")

# Prepare dataset
train_data = pd.DataFrame(
{"text": train.data, "label": train.target.astype(np.int32)}
)
test_data = pd.DataFrame({"text": test.data, "label": test.target.astype(np.int32)})

# Split dataset
train_data, validation_data = train_test_split(train_data, test_size=0.2)

# Upload dataset to S3
train_input = sagemaker_session.upload_data(
path=train_data.to_csv(header=False, index=False),
bucket=bucket,
key_prefix=os.path.join(prefix, "train"),
)
validation_input = sagemaker_session.upload_data(
path=validation_data.to_csv(header=False, index=False),
bucket=bucket,
key_prefix=os.path.join(prefix, "validation"),
)
test_input = sagemaker_session.upload_data(
path=test_data.to_csv(header=False, index=False),
bucket=bucket,
key_prefix=os.path.join(prefix, "test"),
)

# Define hyperparameters
hyperparameters = {
"epochs": 1,
"batch_size": 64,
"max_seq_length": 256,
"model_dir": "/opt/ml/model",
"train_dir": "/opt/ml/input/data/train",
"validation_dir": "/opt/ml/input/data/validation",
"test_dir": "/opt/ml/input/data/test",
"train_filename": "train.csv",
"validation_filename": "validation.csv",
"test_filename": "test.csv",
"freeze_bert_layer": True,
"learning_rate": 0.0001,
"epsilon": 1e-08,
"bert_trainable_layers": None,
"bert_layer": -2,
"lambda_regularizer": 0.01,
"gradient_accumulation_steps": 2,
"steps_per_epoch": 100,
"validation_steps": 10,
"test_steps": 10,
"output_dir": "/opt/ml/output",
}

# Define estimator
estimator = TensorFlow(
entry_point="train.py",
source_dir="code",
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type="ml.p3.2xlarge",
framework_version="2.4.1",
py_version="py37",
hyperparameters=hyperparameters,
)

# Train model
estimator.fit(
{
"train": train_input,
"validation": validation_input,
"test": test_input,
},
wait=True,
logs=True,
)

# Deploy model
predictor = estimator.deploy(
initial_instance_count=1, instance_type="ml.p3.2xlarge", wait=True
)

# Predict
predictor.predict(json.dumps({"instances": [test_data.iloc[0]["text"]]}))
<</RESP>>

You can modify the prompt variable to request different code generation tasks or engage in natural language interactions with the model.

This example demonstrates how to deploy and interact with the Code Llama 70B model on SageMaker JumpStart using Python and the AWS SDK. Because the model might be prone to minor errors in generating the response output, make sure you run the code. Further, you can instruct the model to fact-check the output and refine the model response in order to fix any other unnecessary errors in the code. With this setup, you can leverage the powerful code generation capabilities of Code Llama 70B within your development workflows, streamlining the coding process and unlocking new levels of productivity. Lets take a look at some additional examples.

Additional examples and use cases

Let’s walk through some other complex code generation scenarios. In the following sample, we’re running the script to generate a Deep Q reinforcement learning (RL) agent for playing the CartPole-v0 environment.

Generating a reinforcement learning agent

The following prompt was tested on Code Llama 70B to generate a Deep Q RL agent adept in playing CartPole-v0 environment.

Prompt:

<s>[INST]
<<SYS>>
You are a skilled AI assistant capable of generating Python code for reinforcement learning tasks on Amazon SageMaker. Your code should be efficient, well-documented, and include examples of usage.
<</SYS>>

Could you please generate a Python script that implements a Deep Q-Network (DQN) agent for playing the CartPole-v1 environment? The script should use Amazon SageMaker's TensorFlow estimator and be ready for deployment on SageMaker.
[/INST]

Response: Code Llama generates a Python script for training a DQN agent on the CartPole-v1 environment using TensorFlow and Amazon SageMaker as showcased in our GitHub repository.

Generating a distributed training script

In this scenario, you will generate a sample python code for distributed machine learning training on Amazon SageMaker using Code Llama 70B.

Prompt:

<s>[INST]
<<SYS>>
You are an expert AI assistant skilled in generating Python code for distributed machine learning training on Amazon SageMaker. Your code should be optimized for performance, follow best practices, and include examples of usage.
<</SYS>>

Could you please generate a Python script that performs distributed training of a deep neural network for image classification on the ImageNet dataset? The script should use Amazon SageMaker's PyTorch estimator with distributed data parallelism and be ready for deployment on SageMaker.
[/INST]

Response: Code Llama generates a Python script for distributed training of a deep neural network on the ImageNet dataset using PyTorch and Amazon SageMaker. Additional details are available in our GitHub repository.

Mixtral 8x7B use cases with SageMaker

Compared to traditional LLMs, Mixtral 8x7B offers the advantage of faster decoding at the speed of a smaller, parameter-dense model despite containing more parameters. It also outperforms other open-access models on certain benchmarks and supports a longer context length.

Launch SageMaker JumpStart
Sign in to the console, navigate to SageMaker, and launch the SageMaker domain to open SageMaker Studio. Within SageMaker Studio, select JumpStart in the left-hand navigation menu.
Search for Mixtral 8x7B Instruct
In the JumpStart model hub, search for Mixtral 8x7B Instruct in the search bar. You should see the Mixtral 8x7B Instruct model listed under the Models category.
Deploy the Model
Select the Code Llama 70B model, and then choose Deploy. Enter an endpoint name (or keep the default value) and choose the target instance type (for example, ml.g5.48xlarge). Choose Deploy to start the deployment process. You can leave the rest of the options as default.

Additional details on deployment can be found in Mixtral-8x7B is now available in Amazon SageMaker JumpStart.

Create an inference endpoint
After the deployment is complete, SageMaker will provide you with an inference endpoint URL. Copy this URL to use later.

Generating a hyperparameter tuning script for SageMaker

Hyperparameters are external configuration variables that data scientists use to manage machine learning model training. Sometimes called model hyperparameters, the hyperparameters are manually set before training a model. They’re different from parameters, which are internal parameters automatically derived during the learning process and not set by data scientists. Hyperparameters directly control model structure, function, and performance.

When you build complex machine learning systems like deep learning neural networks, exploring all the possible combinations is impractical. Hyperparameter tuning can accelerate your productivity by trying many variations of a model. It looks for the best model automatically by focusing on the most promising combinations of hyperparameter values within the ranges that you specify. To get good results, you must choose the right ranges to explore.

SageMaker automatic model tuning (AMT) finds the best version of a model by running many training jobs on your dataset. To do this, AMT uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that creates a model that performs the best, as measured by a metric that you choose.

Note: This blog post section contains code that was generated with the assistance of Mixtral 8X7B model, powered by Amazon Sagemaker.

Prompt:

<s>[INST]
<<SYS>>
You are an AI assistant specializing in generating Python code for hyperparameter tuning of machine learning models on Amazon SageMaker. Your code should be efficient, well-documented, and include examples of usage.
<</SYS>>

Generate a Python script that performs hyperparameter tuning of an XGBoost model for regression on the Boston Housing dataset. The script should use Amazon SageMaker's XGBoost estimator and be ready for deployment on SageMaker. It should optimize for validation:accuracy, and optimize the hyperparameters alpha and eta.

[/INST]

Response:

Here's a Python script that performs hyperparameter tuning of an XGBoost model for regression on the Boston Housing dataset using Amazon SageMaker. This script uses the XGBoost estimator and optimizes for validation:accuracy while tuning the hyperparameters `alpha` and `eta`.

```python
import boto3
import numpy as np
import pandas as pd
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

# Set up SageMaker session and specify the IAM role
sagemaker_session = boto3.Session(region_name='us-west-2')
role = sagemaker_session.get_execution_role()

# Specify the SageMaker XGBoost image version
xgboost_image = 'xgboost:1.3-1'

# Define the hyperparameter tuning configuration
hyperparameter_ranges = {
    'alpha': ContinuousParameter(0.1, 1.0),
    'eta': ContinuousParameter(0.01, 0.3)
}

objective_metric_name = 'validation:accuracy'
metric_definitions = [
    {
        'Name': 'validation:accuracy',
        'Regex': 'validation-accuracy: ([0-9\.]+)'
    }
]

tuner = HyperparameterTuner(
    XGBoost(
        role=role,
        train_instance_count=1,
        train_instance_type='ml.m5.large',
        sagemaker_session=sagemaker_session,
        hyperparameters={'objective': 'reg:squarederror', 'num_boost_round': 100},
        base_job_name='xgboost-tuner-boston'
    ),
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=10,
    max_parallel_jobs=3
)

# Load the Boston Housing dataset
boston_data = pd.
read_csv('boston_housing.csv')
train_data = boston_data.sample(frac=0.8, random_state=42)
test_data = boston_data.drop(train_data.index)

train_input_path = sagemaker_session.upload_data(path=train_data, key_prefix='xgboost-tuner/input/train')
test_input_path = sagemaker_session.upload_data(path=test_data, key_prefix='xgboost-tuner/input/test')

# Set up the training and validation datasets
train_channel = sagemaker_session.input_channel(name='train', path=train_input_path)
test_channel = sagemaker_session.input_channel(name='test', path=test_input_path)

# Define the SageMaker XGBoost estimator for training
xgboost_estimator = XGBoost(
    role=role,
    train_instance_count=1,
    train_instance_type='ml.m5.large',
    sagemaker_session=sagemaker_session,
    hyperparameters={'objective': 'reg:squarederror', 'num_boost_round': 100},
    base_job_name='xgboost-tuner-boston'
)

# Perform hyperparameter tuning
tuner.fit({'train': train_channel, 'test': test_channel}, include_cls_metadata=False)

# Get the best hyperparameters and create a new XGBoost estimator
best_training_job = tuner.best_training_job()
best_hyperparameters = tuner.best_hyperparameters
print(f'Best hyperparameters: {best_hyperparameters}')

new_xgboost_estimator = XGBoost(
    role=role,
    train_instance_count=1,
    train_instance_type='ml.m5.large',
    sagemaker_session=sagemaker_session,
    hyperparameters=best_hyperparameters,
    base_job_name='xgboost-tuner-boston'
)

# Train the model with the best hyperparameters
new_xgboost_estimator.fit({'train': train_channel, 'test': test_channel}, include_cls_metadata=False)

# Save the trained model for deployment
model_data = new_xgboost_estimator.create_model()
model_name = 'xgboost-tuner-boston-model'
sagemaker_session.create_model(model_name=model_name, model_data=model_data)

# Deploy the model
predictor = sagemaker_session.create_predictor(model_name,
                                                base_job_name='xgboost-tuner-boston',
                                                deploy_prebuilt_image=True,
                                                entry_point='inference.py',
                                                source_dir='xgboost-tuner',
                                                instance_type='ml.m5.large',
                                                initial_instance_count=1)
```

Replace `'boston_housing.csv'` with the path to your Boston Housing dataset. This script assumes that the dataset has columns named `'CRIM'`, `'ZN'`, `'INDUS'`, `'CHAS'`, `'NOX'`, `'RM'`, `'AGE'`, `'DIS'`, `'RAD'`, `'TAX'`, `'PTRATIO'`, `'B'`, `'LSTAT'`, and `'MEDV'`.

To run this script, you need to have the AWS CLI and Boto3 installed. Make sure to configure your AWS credentials using `aws configure`.

The script also requires a file named `inference.py` in the `xgboost-tuner` directory for deployment. Here's a simple example of `inference.py`:

```python
import json
import xgboost as xgb
import numpy as np

def model_fn(model_dir):
    model = xgb.Booster(model_file=os.path.join(model_dir, 'model.joblib'))
    return model

def input_fn(request_body, content_type):
    if content_type == 'text/csv':
        features = np.loadtxt(pd.compat.StringIO(request_body), delimiter=',')
        return {'features': features.reshape(1, -1)}
    raise ValueError("Request body must be csv-formatted.")

def predict_fn(input_data, model):
    features = input_data['features']
    prediction = model.predict(features)
    return {'prediction': float(prediction)}
```

Make sure to install the required libraries (`boto3`, `numpy`, `pandas`, `xgboost`, and `sagemaker`) before running the script.

Code Transformation: Java to Python

There are instances where users need to convert code written in one programing language to another. This is known as a cross-language transformation task, and foundation models can help automate the process.

Prompt:

<s>[INST]
<<SYS>>
You are a programmer specialized in Java and Python languages. Your code should be efficient, well-documented, and include examples of usage.
<</SYS>>
Convert the following Java code to its equivalent in Python:

<java>
import java.util.ArrayList;

public class Main {
public static void main(String[] args) {
ArrayList<String> cars = new ArrayList<String>();
cars.add("Volvo");
cars.add("BMW");
cars.add("Ford");
cars.add("Mazda");
for (int i = 0; i < cars.size(); i++) {
System.out.println(cars.get(i));
}
}
}
</java>
[/INST]

Response:

<python>
import ArrayList from java.util import ArrayList

def main():
# Create an ArrayList of strings
cars = ArrayList[str]()

# Add elements to the ArrayList
cars.add("Volvo")
cars.add("BMW")
cars.add("Ford")
cars.add("Mazda")

# Iterate over the ArrayList and print each element
for i in range(cars.size()):
print(cars.get(i))

if __name__ == "__main__":
main()
</python>

Note: This Python code uses the Java ArrayList class, which is part of the JPype library that allows Python programs to call Java code. 
The code above will only work in an environment that has JPype installed and configured.

For pure Python solution, here is the equivalent code:

<python>
def main():
# Create a list of strings
cars = ["Volvo", "BMW", "Ford", "Mazda"]

# Iterate over the list and print each element
for car in cars:
print(car)

if __name__ == "__main__":
main()
</python>

This Python code uses a built-in list data structure instead of the Java ArrayList class. The code above is more idiomatic and efficient in Python.

AWS CDK code for a three-tier web application

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework for defining cloud infrastructure as code with modern programming languages and deploying it through AWS CloudFormation.

The three-tier architecture pattern provides a general framework to ensure decoupled and independently scalable application components can be separately developed, managed, and maintained (often by distinct teams). A three-tier architecture is the most popular implementation of a multi-tier architecture and consists of a single presentation tier, logic tier, and data tier:

Presentation tier: Component that the user directly interacts with (for example, webpages and mobile app UIs).
Logic tier: Code required to translate user actions to application functionality (for example, CRUD database operations and data processing).
Data tier: Storage media (for example, databases, object stores, caches, and file systems) that hold the data relevant to the application.

Prompt:

<s><<SYS>>
You are a programmer specialized in AWS, and programming in Python and the AWS Cloud Development Kit. Your code should be efficient, well-documented, and include examples of usage.
<</SYS>>[INST]
Create a CDK script in Python for configuring and deploying a three tier web application in AWS. The architecture should consist of these three tiers:

- Presentation tier: The front-end application is all static content (HTML, CSS, JavaScript and images) which are generated by React utilities like create-react-app. Amazon CloudFront hosts all these objects. The web application, when used, downloads all the resources to the browser and starts to run from there. The web application connects to the backend calling the APIs.

- Logic tier: Logic layer is built using Lambda functions fronted by API Gateway REST APIs. This architecture shows multiple exposed services. There are multiple different Lambda functions each handling a different aspect of the application. The Lambda functions are behind API Gateway and accessible using API URL paths.

- Data tier: DynamoDB is used for the data storage.
[/INST]

Response:

Sure, here's an example CDK script in Python that configures and deploys a three-tier web application in AWS:
```python
from aws_cdk import (
aws_apigateway as apigw,
aws_lambda as _lambda,
aws_lambda_event_sources as lambda_event_sources,
aws_logs as logs,
aws_dynamodb as ddb,
core
)

class ThreeTierWebAppStack(core.Stack):

def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
super().__init__(scope, id, **kwargs)

# Data tier
self.table = ddb.Table(
self, "DataTable",
partition_key=ddb.Attribute(name="id", type=ddb.AttributeType.STRING),
billing_mode=ddb.BillingMode.PAY_PER_REQUEST
)

# Logic tier
self.lambda_function = _lambda.Function(
self, "LambdaFunction",
runtime=_lambda.Runtime.PYTHON_3_8,
code=_lambda.Code.from_asset("lambda"),
handler="lambda_function.lambda_handler",
log_retention=logs.RetentionDays.ONE_DAY,
environment={
"TABLE_NAME": self.table.table_name
}
)

self.lambda_function.add_to_role_policy(
core.PolicyStatement(
actions=[
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem"
],
resources=[self.table.table_arn]
)
)

self.api = apigw.LambdaRestApi(
self, "Api",
handler=self.lambda_function,
proxy=False
)

# Presentation tier
self.distribution = aws_cloudfront.Distribution(
self, "Distribution",
default_root_object="index.html",
default_behavior_options=aws_cloudfront.BehaviorOptions(
origin=aws_cloudfront.Origin(
domain_name=self.api.url,
origin_path="/integration/v1"
),
viewer_protocol_policy=aws_cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS
)
)
```
Here's a brief explanation of the code:

* The `ThreeTierWebAppStack` class extends the `core.Stack` class and is the main entry point for the CDK script.
* The `table` attribute is a DynamoDB table used for data storage.
* The `lambda_function` attribute is a Lambda function that handles the business logic of the application. It has an IAM role that allows it to read and write to the DynamoDB table.
* The `api` attribute is an API Gateway REST API that exposes the Lambda function as a web service.
* The `distribution` attribute is a CloudFront distribution that serves the static content of the application. It uses the API Gateway as the origin and redirects all HTTP traffic to HTTPS.

To use this script, you need to create a `lambda` directory in the same directory as the script and put your Lambda function code in a file named `lambda_function.py` in that directory. You also need to install the AWS CDK and its dependencies using `pip install aws-cdk.core aws-cdk.aws-lambda aws-cdk.aws-lambda_event_sources aws-cdk.aws-logs aws-cdk.aws-dynamodb aws-cdk.aws-apigateway aws-cdk.aws-cloudfront`.

Once you have done that, you can deploy the stack using `cdk deploy`. This will create the DynamoDB table, the Lambda function, the API Gateway, and the CloudFront distribution. The URL of the CloudFront distribution will be printed to the console after the deployment is complete. You can use this URL to access the web application.

Additional considerations

The following are some additional considerations when implementing these models:

Different models will produce different results, so you should conduct experiments with different foundation models and different prompts for your use case to achieve the desired results.
The analyses provided are not meant to replace human judgement. You should be mindful of potential hallucinations when working with generative AI, and use the analysis only as a tool to assist and speed up code generation.

Clean up

Delete the model endpoints deployed using Amazon SageMaker for Code Llama and Mistral to avoid incurring any additional costs in your account.

Shut down any SageMaker Notebook instances that were created for deploying or running the examples showcased in this blog post to avoid any notebook instance costs associated with the account.

Conclusion

The combination of exceptional capabilities from foundation models like Code Llama 70B and Mixtral 8x7B and the powerful machine learning platform of Sagemaker, presents a unique opportunity for developers and data scientists to revolutionize their coding workflows. The cutting-edge capabilities of FMs empower customers to generate high-quality code, infill missing sections, and engage in natural language interactions, all while using the scalability, security, and compliance of AWS.

The examples highlighted in this blog post demonstrate these models’ advanced capabilities in generating complex code for various machine learning tasks, such as natural language processing, reinforcement learning, distributed training, and hyperparameter tuning, all tailored for deployment on SageMaker. Developers and data scientists can now streamline their workflows, accelerate development cycles, and unlock new levels of productivity in the AWS Cloud.

Embrace the future of AI-assisted coding and unlock new levels of productivity with Code Llama 70B and Mixtral 8x7B on Amazon SageMaker. Start your journey today and experience the transformative power of this groundbreaking language model.

References

About the Authors

Shikhar Kwatra is an AI/ML Solutions Architect at Amazon Web Services based in California. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partners in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.

Jose Navarro is an AI/ML Solutions Architect at AWS based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production. In his spare time, he loves to exercise, spend quality time with friends and family, and catch up on AI news and papers.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

Apple Machine Learning Research

Introducing Apple’s On-Device and Server Foundation Models

Apple Machine Learning Research

A quick guide to Amazon’s 30+ papers at NAACL 2024

Although work involving large language models predominates, classical and more-general techniques remain well represented.Read More

Why Accelerated Data Processing Is Crucial for AI Innovation in Every Industry

Across industries, AI is supercharging innovation with machine-powered computation. In finance, bankers are using AI to detect fraud more quickly and keep accounts safe, telecommunications providers are improving networks to deliver superior service, scientists are developing novel treatments for rare diseases, utility companies are building cleaner, more reliable energy grids and automotive companies are making self-driving cars safer and more accessible.

The backbone of top AI use cases is data. Effective and precise AI models require training on extensive datasets. Enterprises seeking to harness the power of AI must establish a data pipeline that involves extracting data from diverse sources, transforming it into a consistent format and storing it efficiently.

Data scientists work to refine datasets through multiple experiments to fine-tune AI models for optimal performance in real-world applications. These applications, from voice assistants to personalized recommendation systems, require rapid processing of large data volumes to deliver real-time performance.

As AI models become more complex and begin to handle diverse data types such as text, audio, images, and video, the need for rapid data processing becomes more critical. Organizations that continue to rely on legacy CPU-based computing are struggling with hampered innovation and performance due to data bottlenecks, escalating data center costs, and insufficient computing capabilities.

Many businesses are turning to accelerated computing to integrate AI into their operations. This method leverages GPUs, specialized hardware, software, and parallel computing techniques to boost computing performance by as much as 150x and increase energy efficiency by up to 42x.

Leading companies across different sectors are using accelerated data processing to spearhead groundbreaking AI initiatives.

Finance Organizations Detect Fraud in a Fraction of a Second

Financial organizations face a significant challenge in detecting patterns of fraud due to the vast amount of transactional data that requires rapid analysis. Additionally, the scarcity of labeled data for actual instances of fraud poses a difficulty in training AI models. Conventional data science pipelines lack the required acceleration to handle the large data volumes associated with fraud detection. This leads to slower processing times that hinder real-time data analysis and fraud detection capabilities.

To overcome these challenges, American Express, which handles more than 8 billion transactions per year, uses accelerated computing to train and deploy long short-term memory (LSTM) models. These models excel in sequential analysis and detection of anomalies, and can adapt and learn from new data, making them ideal for combating fraud.

Leveraging parallel computing techniques on GPUs, American Express significantly speeds up the training of its LSTM models. GPUs also enable live models to process huge volumes of transactional data to make high-performance computations to detect fraud in real time.

The system operates within two milliseconds of latency to better protect customers and merchants, delivering a 50x improvement over a CPU-based configuration. By combining the accelerated LSTM deep neural network with its existing methods, American Express has improved fraud detection accuracy by up to 6% in specific segments.

Financial companies can also use accelerated computing to reduce data processing costs. Running data-heavy Spark3 workloads on NVIDIA GPUs, PayPal confirmed the potential to reduce cloud costs by up to 70% for big data processing and AI applications.

By processing data more efficiently, financial institutions can detect fraud in real time, enabling faster decision-making without disrupting transaction flow and minimizing the risk of financial loss.

Telcos Simplify Complex Routing Operations

Telecommunications providers generate immense amounts of data from various sources, including network devices, customer interactions, billing systems, and network performance and maintenance.

Managing national networks that handle hundreds of petabytes of data every day requires complex technician routing to ensure service delivery. To optimize technician dispatch, advanced routing engines perform trillions of computations, taking into account factors like weather, technician skills, customer requests and fleet distribution. Success in these operations depends on meticulous data preparation and sufficient computing power.

AT&T, which operates one of the nation’s largest field dispatch teams to service its customers, is enhancing data-heavy routing operations with NVIDIA cuOpt, which relies on heuristics, metaheuristics and optimizations to calculate complex vehicle routing problems.

In early trials, cuOpt delivered routing solutions in 10 seconds, achieving a 90% reduction in cloud costs and enabling technicians to complete more service calls daily. NVIDIA RAPIDS, a suite of software libraries that enables acceleration of data science and analytics pipelines, further accelerates cuOpt, allowing companies to integrate local search heuristics and metaheuristics like Tabu search for continuous route optimization.

AT&T is adopting NVIDIA RAPIDS Accelerator for Apache Spark to enhance the performance of Spark-based AI and data pipelines. This has helped the company boost operational efficiency on everything from training AI models to maintaining network quality to reducing customer churn and improving fraud detection. With RAPIDS Accelerator, AT&T is reducing its cloud computing spend for target workloads while enabling faster performance and reducing its carbon footprint.

Accelerated data pipelines and processing will be critical as telcos seek to improve operational efficiency while delivering the highest possible service quality.

Biomedical Researchers Condense Drug Discovery Timelines

As researchers utilize technology to study the roughly 25,000 genes in the human genome to understand their relationship with diseases, there has been an explosion of medical data and peer-reviewed research papers. Biomedical researchers rely on these papers to narrow down the field of study for novel treatments. However, conducting literature reviews of such a massive and expanding body of relevant research has become an impossible task.

AstraZeneca, a leading pharmaceutical company, developed a Biological Insights Knowledge Graph (BIKG) to aid scientists across the drug discovery process, from literature reviews to screen hit rating, target identification and more. This graph integrates public and internal databases with information from scientific literature, modeling between 10 million and 1 billion complex biological relationships.

BIKG has been effectively used for gene ranking, aiding scientists in hypothesizing high-potential targets for novel disease treatments. At NVIDIA GTC, the AstraZeneca team presented a project that successfully identified genes linked to resistance in lung cancer treatments.

To narrow down potential genes, data scientists and biological researchers collaborated to define the criteria and gene features ideal for targeting in treatment development. They trained a machine learning algorithm to search the BIKG databases for genes with the designated features mentioned in literature as treatable. Utilizing NVIDIA RAPIDS for faster computations, the team reduced the initial gene pool from 3,000 to just 40 target genes, a task that previously took months but now takes mere seconds.

By supplementing drug development with accelerated computing and AI, pharmaceutical companies and researchers can finally use the enormous troves of data building up in the medical field to develop novel drugs faster and more safely, ultimately having a life-saving impact.

Utility Companies Build the Future of Clean Energy

There’s been a significant push to shift to carbon-neutral energy sources in the energy sector. With the cost of harnessing renewable resources such as solar energy falling drastically over the last 10 years, the opportunity to make real progress toward a clean energy future has never been greater.

However, this shift toward integrating clean energy from wind farms, solar farms and home batteries has introduced new complexities in grid management. As energy infrastructure diversifies and two-way power flows must be accommodated, managing the grid has become more data-intensive. New smart grids are now required to handle high-voltage areas for vehicle charging. They must also manage the availability of distributed stored energy sources and adapt to variations in usage across the network.

Utilidata, a prominent grid-edge software company, has collaborated with NVIDIA to develop a distributed AI platform, Karman, for the grid edge using a custom NVIDIA Jetson Orin edge AI module. This custom chip and platform, embedded in electricity meters, transforms each meter into a data collection and control point, capable of handling thousands of data points per second.

Karman processes real-time, high-resolution data from meters at the network’s edge. This enables utility companies to gain detailed insights into grid conditions, predict usage and seamlessly integrate distributed energy resources in seconds, rather than minutes or hours. Additionally, with inference models on edge devices, network operators can anticipate and quickly identify line faults to predict potential outages and conduct preventative maintenance to increase grid reliability.

Through the integration of AI and accelerated data analytics, Karman helps utility providers transform existing infrastructure into efficient smart grids. This allows for tailored, localized electricity distribution to meet fluctuating demand patterns without extensive physical infrastructure upgrades, facilitating a more cost-effective modernization of the grid.

Automakers Enable Safer, More Accessible, Self-Driving Vehicles

As auto companies strive for full self-driving capabilities, vehicles must be able to detect objects and navigate in real time. This requires high-speed data processing tasks, including feeding live data from cameras, lidar, radar and GPS into AI models that make navigation decisions to keep roads safe.

The autonomous driving inference workflow is complex and includes multiple AI models along with necessary preprocessing and postprocessing steps. Traditionally, these steps were handled on the client side using CPUs. However, this can lead to significant bottlenecks in processing speeds, which is an unacceptable drawback for an application where fast processing equates to safety.

To enhance the efficiency of autonomous driving workflows, electric vehicle manufacturer NIO integrated NVIDIA Triton Inference Server into its inference pipeline. NVIDIA Triton is open-source, multi-framework, inference-serving software. By centralizing data processing tasks, NIO reduced latency by 6x in some core areas and increased overall data throughput by up to 5x.

NIO’s GPU-centric approach made it easier to update and deploy new AI models without the need to change anything on the vehicles themselves. Additionally, the company could use multiple AI models at the same time on the same set of images without having to send data back and forth over a network, which saved on data transfer costs and improved performance.

By using accelerated data processing, autonomous vehicle software developers ensure they can reach a high-performance standard to avoid traffic accidents, lower transportation costs and improve mobility for users.

Retailers Improve Demand Forecasting

In the fast-paced retail environment, the ability to process and analyze data quickly is critical to adjusting inventory levels, personalizing customer interactions and optimizing pricing strategies on the fly. The larger a retailer is and the more products it carries, the more complex and compute-intensive its data operations will be.

Walmart, the largest retailer in the world, turned to accelerated computing to significantly improve forecasting accuracy for 500 million item-by-store combinations across 4,500 stores.

As Walmart’s data science team built more robust machine learning algorithms to take on this mammoth forecasting challenge, the existing computing environment began to falter, with jobs failing to complete or generating inaccurate results. The company found that data scientists were having to remove features from algorithms just so they would run to completion.

To improve its forecasting operations, Walmart started using NVIDIA GPUs and RAPIDs. The company now uses a forecasting model with 350 data features to predict sales across all product categories. These features encompass sales data, promotional events, and external factors like weather conditions and major events like the Super Bowl, which influence demand.

Advanced models helped Walmart improve forecast accuracy from 94% to 97% while eliminating an estimated $100 million in fresh produce waste and reducing stockout and markdown scenarios. GPUs also ran models 100x faster with jobs complete in just four hours, an operation that would’ve taken several weeks in a CPU environment.

By shifting data-intensive operations to GPUs and accelerated computing, retailers can lower both their cost and their carbon footprint while delivering best-fit choices and lower prices to shoppers.

Public Sector Improves Disaster Preparedness

Drones and satellites capture huge amounts of aerial image data that public and private organizations use to predict weather patterns, track animal migrations and observe environmental changes. This data is invaluable for research and planning, enabling more informed decision-making in fields like agriculture, disaster management and efforts to combat climate change. However, the value of this imagery can be limited if it lacks specific location metadata.

A federal agency working with NVIDIA needed a way to automatically pinpoint the location of images missing geospatial metadata, which is essential for missions such as search and rescue, responding to natural disasters and monitoring the environment. However, identifying a small area within a larger region using an aerial image without metadata is extremely challenging, akin to locating a needle in a haystack. Algorithms designed to help with geolocation must address variations in image lighting and differences due to images being taken at various times, dates and angles.

To identify non-geotagged aerial images, NVIDIA, Booz Allen and the government agency collaborated on a solution that uses computer vision algorithms to extract information from image pixel data to scale the image similarity search problem.

When attempting to solve this problem, an NVIDIA solutions architect first used a Python-based application. Initially running on CPUs, processing took more than 24 hours. GPUs supercharged this to just minutes, performing thousands of data operations in parallel versus only a handful of operations on a CPU. By shifting the application code to CuPy, an open-sourced GPU-accelerated library, the application experienced a remarkable 1.8-million-x speedup, returning results in 67 microseconds.

With a solution that can process images and the data of large land masses in just minutes, organizations can gain access to the critical information needed to respond more quickly and effectively to emergencies and plan proactively, potentially saving lives and safeguarding the environment.

Accelerate AI Initiatives and Deliver Business Results

Companies using accelerated computing for data processing are advancing AI initiatives and positioning themselves to innovate and perform at higher levels than their peers.

Accelerated computing handles larger datasets more efficiently, enables faster model training and selection of optimal algorithms, and facilitates more precise results for live AI solutions.

Enterprises that use it can achieve superior price-performance ratios compared to traditional CPU-based systems and enhance their ability to deliver outstanding results and experiences to customers, employees and partners.

Learn how accelerated computing helps organizations achieve AI objectives and drive innovation.

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

Today, we are excited to announce that the Jina Embeddings v2 model, developed by Jina AI, is available for customers through Amazon SageMaker JumpStart to deploy with one click for running model inference. This state-of-the-art model supports an impressive 8,192-tokens context length. You can deploy this model with SageMaker JumpStart, a machine learning (ML) hub with foundation models, built-in algorithms, and pre-built ML solutions that you can deploy with just a few clicks.

Text embedding refers to the process of transforming text into numerical representations that reside in a high-dimensional vector space. Text embeddings have a broad range of applications in enterprise artificial intelligence (AI), including the following:

Multimodal search for ecommerce
Content personalization
Recommender systems
Data analytics

Jina Embeddings v2 is a state-of-the-art collection of text embedding models, trained by Berlin-based Jina AI, that boast high performance on several public benchmarks.

In this post, we walk through how to discover and deploy the jina-embeddings-v2 model as part of a Retrieval Augmented Generation (RAG)-based question answering system in SageMaker JumpStart. You can use this tutorial as a starting point for a variety of chatbot-based solutions for customer service, internal support, and question answering systems based on internal and private documents.

What is RAG?

RAG is the process of optimizing the output of a large language model (LLM) so it references an authoritative knowledge base outside of its training data sources before generating a response.

LLMs are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It’s a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

What does Jina Embeddings v2 bring to RAG applications?

A RAG system uses a vector database to serve as a knowledge retriever. It must extract a query from a user’s prompt and send it to a vector database to reliably find as much semantic information as possible. The following diagram illustrates the architecture of a RAG application with Jina AI and Amazon SageMaker.

Jina Embeddings v2 is the preferred choice for experienced ML scientists for the following reasons:

State-of-the-art performance – We have shown on various text embedding benchmarks that Jina Embeddings v2 models excel on tasks such as classification, reranking, summarization, and retrieval. Some of the benchmarks demonstrating their performance are MTEB, an independent study of combining embedding models with reranking models, and the LoCo benchmark by a Stanford University group.
Long input-context length – Jina Embeddings v2 models support 8,192 input tokens. This makes the models especially powerful at tasks such as clustering for long documents like legal text or product documentation.
Support for bilingual text input – Recent research shows that multilingual models without specific language training show strong biases towards English grammatical structures in embeddings. Jina AI’s bilingual embedding models include jina-embeddings-v2-base-de, jina-embeddings-v2-base-zh, jina-embeddings-v2-base-es, and jina-embeddings-v2-base-code. They were trained to encode texts in a combination of English-German, English-Chinese, English-Spanish, and English-Code, respectively, allowing the use of either language as the query or target document in retrieval applications.
Cost-effectiveness of operating – Jina Embeddings v2 provides high performance on information retrieval tasks with relatively small models and compact embedding vectors. For example, jina-embeddings-v2-base-de has a size of 322 MB with a performance score of 60.1%. A smaller vector size means a great amount of cost savings while storing them in a vector database.

What is SageMaker JumpStart?

With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. Developers can deploy foundation models to dedicated SageMaker instances within a network-isolated environment, and customize models using SageMaker for model training and deployment.

You can now discover and deploy a Jina Embeddings v2 model with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines and Amazon SageMaker Debugger. With SageMaker JumpStart, the model is deployed in an AWS secure environment and under your VPC controls, helping provide data security.

Jina Embeddings models are available in AWS Marketplace so you can integrate them directly into your deployments when working in SageMaker.

AWS Marketplace enables you to find third-party software, data, and services that run on AWS and manage them from a centralized location. AWS Marketplace includes thousands of software listings and simplifies software licensing and procurement with flexible pricing options and multiple deployment methods.

Solution overview

We’ve prepared a notebook that constructs and runs a RAG question answering system using Jina Embeddings and the Mixtral 8x7B LLM in SageMaker JumpStart.

In the following sections, we give you an overview of the main steps needed to bring a RAG application to life using generative AI models on SageMaker JumpStart. Although we omit some of the boilerplate code and installation steps in this post for reasons of readability, you can access the full Python notebook to run yourself.

Connecting to a Jina Embeddings v2 endpoint

To start using Jina Embeddings v2 models, complete the following steps:

In SageMaker Studio, choose JumpStart in the navigation pane.
Search for “jina” and you will see the provider page link and models available from Jina AI.
Choose Jina Embeddings v2 Base – en, which is Jina AI’s English language embeddings model.
Choose Deploy.
In the dialog that appears, choose Subscribe, which will redirect you to the model’s AWS Marketplace listing, where you can subscribe to the model after accepting the terms of usage.
After subscribing, return to the Sagemaker Studio and choose Deploy.
You will be redirected to the endpoint configuration page, where you can select the instance most suitable for your use case and provide a name for the endpoint.
Choose Deploy.

After you create the endpoint, you can connect to it with the following code snippet:

from jina_sagemaker import Client
 
client = Client(region_name=region)
# Make sure that you’ve given the same name my-jina-embeddings-endpoint to the Jumpstart endpoint in the previous step.
endpoint_name = "my-jina-embeddings-endpoint"
 
client.connect_to_endpoint(endpoint_name=endpoint_name)

Preparing a dataset for indexing

In this post, we use a public dataset from Kaggle (CC0: Public Domain) that contains audio transcripts from the popular YouTube channel Kurzgesagt – In a Nutshell, which has over 20 million subscribers.

Each row in this dataset contains the title of a video, its URL, and the corresponding text transcript.

Enter the following code:

df.head()

Because the transcript of these videos can be quite long (around 10 minutes), in order to find only the relevant content for answering users’ questions and not other parts of the transcripts that are unrelated, you can chunk each of these transcripts before indexing them:

def chunk_text(text, max_words=1024):
    """
    Divide text into chunks where each chunk contains the maximum number of full sentences under `max_words`.
    """
    sentences = text.split('.')
    chunk = []
    word_count = 0
 
    for sentence in sentences:
        sentence = sentence.strip(".")
        if not sentence:
          continue
 
        words_in_sentence = len(sentence.split())
        if word_count + words_in_sentence <= max_words:
            chunk.append(sentence)
            word_count += words_in_sentence
        else:
            # Yield the current chunk and start a new one
            if chunk:
              yield '. '.join(chunk).strip() + '.'
            chunk = [sentence]
            word_count = words_in_sentence
 
    # Yield the last chunk if it's not empty
    if chunk:
        yield ' '.join(chunk).strip() + '.'

The parameter max_words defines the maximum number of full words that can be in a chunk of indexed text. Many chunking strategies exist in academic and non-peer-reviewed literature that are more sophisticated than a simple word limit. However, for the purpose of simplicity, we use this technique in this post.

Index text embeddings for vector search

After you chunk the transcript text, you obtain embeddings for each chunk and link each chunk back to the original transcript and video title:

def generate_embeddings(text_df):
    """
    Generate an embedding for each chunk created in the previous step.
    """

    chunks = list(chunk_text(text_df['Text']))
    embeddings = []
 
    for i, chunk in enumerate(chunks):
      response = client.embed(texts=[chunk])
      chunk_embedding = response[0]['embedding']
      embeddings.append(np.array(chunk_embedding))
 
    text_df['chunks'] = chunks
    text_df['embeddings'] = embeddings
    return text_df
 
print("Embedding text chunks ...")
df = df.progress_apply(generate_embeddings, axis=1)

The dataframe df contains a column titled embeddings that can be put into any vector database of your choice. Embeddings can then be retrieved from the vector database using a function such as find_most_similar_transcript_segment(query, n), which will retrieve the n closest documents to the given input query by a user.

Prompt a generative LLM endpoint

For question answering based on an LLM, you can use the Mistral 7B-Instruct model on SageMaker JumpStart:

from sagemaker.jumpstart.model import JumpStartModel
from string import Template

# Define the LLM to be used and deploy through Jumpstart.
jumpstart_model = JumpStartModel(model_id="huggingface-llm-mistral-7b-instruct", role=role)
model_predictor = jumpstart_model.deploy()

# Define the prompt template to be passed to the LLM
prompt_template = Template("""
  <s>[INST] Answer the question below only using the given context.
  The question from the user is based on transcripts of videos from a YouTube
    channel.
  The context is presented as a ranked list of information in the form of
    (video-title, transcript-segment), that is relevant for answering the
    user's question.
  The answer should only use the presented context. If the question cannot be
    answered based on the context, say so.
 
  Context:
  1. Video-title: $title_1, transcript-segment: $segment_1
  2. Video-title: $title_2, transcript-segment: $segment_2
  3. Video-title: $title_3, transcript-segment: $segment_3
 
  Question: $question
 
  Answer: [/INST]
""")

Query the LLM

Now, for a query sent by a user, you first find the semantically closest n chunks of transcripts from any video of Kurzgesagt (using vector distance between embeddings of chunks and the users’ query), and provide those chunks as context to the LLM for answering the users’ query:

# Define the query and insert it into the prompt template together with the context to be used to answer the question
question = "Can climate change be reversed by individuals' actions?"
search_results = find_most_similar_transcript_segment(question)
 
prompt_for_llm = prompt_template.substitute(
    question = question,
    title_1 = df.iloc[search_results[0][1]]["Title"].strip(),
    segment_1 = search_results[0][0],
    title_2 = df.iloc[search_results[1][1]]["Title"].strip(),
    segment_2 = search_results[1][0],
    title_3 = df.iloc[search_results[2][1]]["Title"].strip(),
    segment_3 = search_results[2][0]
)

# Generate the answer to the question passed in the propt
payload = {"inputs": prompt_for_llm}
model_predictor.predict(payload)

Based on the preceding question, the LLM might respond with an answer such as the following:

Based on the provided context, it does not seem that individuals can solve climate change solely through their personal actions. While personal actions such as using renewable energy sources and reducing consumption can contribute to mitigating climate change, the context suggests that larger systemic changes are necessary to address the issue fully.

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped. Use the following code:

model_predictor.delete_model()
model_predictor.delete_endpoint()

Conclusion

By taking advantage of the features of Jina Embeddings v2 to develop RAG applications, together with the streamlined access to state-of-the-art models on SageMaker JumpStart, developers and businesses are now empowered to create sophisticated AI solutions with ease.

Jina Embeddings v2’s extended context length, support for bilingual documents, and small model size enables enterprises to quickly build natural language processing use cases based on their internal datasets without relying on external APIs.

Get started with SageMaker JumpStart today, and refer to the GitHub repository for the complete code to run this sample.

Connect with Jina AI

Jina AI remains committed to leadership in bringing affordable and accessible AI embeddings technology to the world. Our state-of-the-art text embedding models support English and Chinese and soon will support German, with other languages to follow.

For more information about Jina AI’s offerings, check out the Jina AI website or join our community on Discord.

About the Authors

Francesco Kruk is Product Managment intern at Jina AI and is completing his Master’s at ETH Zurich in Management, Technology, and Economics. With a strong business background and his knowledge in machine learning, Francesco helps customers implement RAG solutions using Jina Embeddings in an impactful way.

Saahil Ognawala is Head of Product at Jina AI based in Munich, Germany. He leads the development of search foundation models and collaborates with clients worldwide to enable quick and efficient deployment of state-of-the-art generative AI products. With an academic background in machine learning, Saahil is now interested in scaled applications of generative AI in the knowledge economy.

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS based in Munich, Germany. Roy helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS. Roy is passionate about computational optimization problems and improving the performance of AI workloads.

Here Comes a New Challenger: ‘Street Fighter 6’ Joins GeForce NOW

Capcom’s latest entry in the iconic Street Fighter series, Street Fighter 6, punches its way into the cloud this GFN Thursday. The game, along with Ubisoft’s XDefiant, leads six new games joining the GeForce NOW library.

A new reward makes its way to the cloud gaming service’s Ultimate and Priority members. For a limited time, GeForce NOW members who are new to Xbox PC Game Pass can get three months of Microsoft’s subscription service free, just by opting into the GeForce NOW Rewards program.

A beautiful place to be
A Plague Tale Requiem ( please don’t feed the rats if you see them ) on GeForce Now Ultimate ( @NVIDIAGFN )

#GreetingsFromGFN #VirtualPhotography #VGPUnite #VPRT #GFNShare #GFNAmbassador pic.twitter.com/sYeBOyozRb

— Giraphone (@giraph_1) June 4, 2024

Plus, make sure to follow @NVIDIAGFN on X to see picturesque in-game locations from where members are sending their #GreetingsfromGFN.

Get Ready to Rumble

Unleash the ultimate Hadoken with Street Fighter 6 on GeForce NOW. The renowned 2D fighting game returns with intense battles, special moves, combos and Super Art attacks to defeat opponents. With a roster of 22 iconic fighters, including classic World Warriors like Ryu, Chun-Li, Guile and Akuma, plus all-new characters like Kimberly, Jamie, Marisa and Manon, there’s no better time to hit the streets.

The newest installment introduces innovative features and enhanced visuals across three distinct game modes — Fighting Ground, World Tour and Battle Hub — for gamers to level up and put their skills to the test. The game’s blend of classic mechanics and fresh enhancements is captivating longtime fans and newcomers alike.

Become a World Warrior in the cloud with a GeForce NOW Ultimate membership and stream all the fighting glory at up to stunning 4K resolution. Witness every punch, kick and Hadoken with others by hopping online for some head-to-head competition.

Defying Gravity

XDefiant on GeForce NOW — *Discover which faction will reign supreme in “XDefiant.”*

XDefiant, a free-to-play first-person shooter, combines intense gunplay with strategic team dynamics. Set in a world where factions inspired by iconic Ubisoft franchises clash, the game enables players to customize their loadouts and engage in fast-paced battles. Choose stealthy tactics or all-out aggression for a diverse and thrilling multiplayer experience.

Prepare for adrenaline-fueled firefights and tactical showdowns at up to 240 frames per second with an Ultimate membership. Every frame counts in the fight against other factions.

Get in the Pass Lane

PC Game Pass member reward on GeForce NOW — *It’s rewarding to be a GeForce NOW member.*

Get ready for a summer of gaming. GeForce NOW Ultimate and Priority members new to PC Game Pass and part of the GeForce NOW Rewards program can now receive three free months of Microsoft’s service.

With PC Game Pass and GeForce NOW, members can play high-quality Xbox PC titles with the power of an NVIDIA GeForce RTX server in the cloud. Jump into the action in iconic franchises like Starfield, Forza Motorsport and Remnant II with support for more titles added every GFN Thursday.

This special offer is available for a limited time, and only for GeForce NOW members new to PC Game Pass.

Mischief Managed

Sneak out on GeForce NOW — *Hide and seek on an epic scale.*

Get into all kinds of mischief and fun in Sneak Out from Kinguin Studios. Enter the Haunted Castle and prepare to hunt, hide or prank, causing all kinds of hilarious mayhem while trying to win a deadly game of hide and seek.

Check out the list of new games this week:

Killer Klowns from Outer Space: The Game (New release on Steam, June 4)
Autopsy Simulator (New release on Steam, June 6)
Chornobyl Liquidators (New release on Steam, June 6)
Sneak Out (New release on Steam, June 6)
Farm Together 2 (Steam)
Street Fighter 6 (Steam)
XDefiant (Ubisoft)

What are you planning to play this weekend? Let us know on X or in the comments below.

+ Punch

— NVIDIA GeForce NOW (@NVIDIAGFN) June 5, 2024

NotebookLM goes global with Slides support and better ways to fact-check

NotebookLM, our AI-powered research and writing assistant, is getting new features and expanding to over 200 countries and territories.Read More

INT4 Decoding GQA CUDA Optimizations for LLM Inference

An efficient decoding Grouped-Query Attention with low-precision KV cache

Introduction

Generative AI has taken the world by storm with its ability to generate content like humans. Many of these generative AI tools are powered by large language models (LLMs), like Meta Llama models and OpenAI’s ChatGPT. One of the main challenges of LLMs is supporting large “context lengths” (also known as “sequence lengths”). The context length refers to the number of tokens that the model uses to understand the input context and generate responses. Longer context lengths generally translate into higher precision and quality in the responses. However, long context lengths are compute and memory intensive. This is mainly due to the following reasons:

The computational complexity of attention layers increases proportionally with the context length (the growth rate depends on the attention algorithm). As a result, when using long context lengths, the attention layers can become a bottleneck, particularly during the prefill phase where attentions are compute bound.
The KV cache size grows linearly with the context length, thus, putting higher pressure on the memory requirement and consequently slowing down the already memory-bound attention decoding. Moreover, since the memory capacity is limited, the batch size reduces when the KV cache gets bigger, which generally results in a drop in throughput.

The computational complexity growth is difficult to solve compared to the other problem mentioned above. One way to address the KV cache size growth problem is to use low precision KV cache. From our experiments, group-wise INT4 quantization provides comparable results in terms of accuracy compared to BF16 KV cache during the decode phase in Meta Llama 2 inference. However, we did not observe any latency improvement, despite reading 4x lesser data in attention decoding layers. This means that the INT4 attention is 4x less efficient at utilizing precious HBM bandwidth than BF16 attention.

In this note, we discuss the CUDA optimizations that we applied to INT4 GQA (grouped-query attention – the attention layer that we use in the LLM inference phase) to improve its performance by up to 1.8x on the NVIDIA A100 GPU and 1.9x on the NVIDIA H100 GPU.

The optimized CUDA INT4 GQA outperformed INT4 Flash-Decoding GQA (the best performing INT4 GQA that we used in the experiment mentioned above) by 1.4x-1.7x on A100 and 1.09x-1.3x on H100.
The optimized CUDA INT4 GQA performs better than BF16 Flash-Decoding GQA by 1.5x-1.7x on A100 and 1.4x-1.7x on H100.

Background

GQA for LLM Inference

Grouped-Query Attention (GQA) is a variant of multi-head attention (MHA) where each KV cache head is shared across a group of query heads. Our LLM inference adopts GQA as an attention layer in both the prefill and decode phases in order to reduce the capacity requirement for the KV cache. We use multiple GPUs in inference where the KV cache and query heads are distributed across GPUs. Each GPU runs an attention layer with a single KV head and a group of Q heads. Therefore, when viewed from a single GPU perspective, the GQA component can also be described as MQA (Multi-Query Attention).

The simplified workflow of decoding GQA is illustrated in Figure 1. GQA takes three main inputs: input query (denoted Q), K cache (denoted K), and V cache (denoted V). Our current GQA inference uses BF16 for Q, K, and V.

Q is a 4D BF16 tensor of shape (B, 1, H_Q, D)
K is a 4D BF16 tensor of shape (B, T_max, H_KV, D)
V is a 4D BF16 tensor of shape (B, T_max, H_KV, D)

where

B is the batch size (the number of input prompts)
H_Q is the number of query heads
H_KV is the number of KV heads (H_Q must be divisible by H_KV)
T_max is the maximum context length
D is the head dimension (fixed to 128)

GQA is simply bmm(softmax(bmm(Q, K^T) / sqrt(D)), V). This yields a single output tensor (denoted as O) which is a 4D BF16 tensor that has the same shape as Q. Note that matrix multiplications are performed using BF16, however, accumulation and softmax are carried out in FP32. We call this “BF16 GQA” as the KV cache is BF16.

Figure 1 The simplified workflow of BF16 GQA for LLM inference

INT4 GQA

To further reduce the size of the KV cache, we explore the possibility of using INT4 for KV cache instead of BF16. We estimate the potential performance improvement by calculating the computational intensity (CI) of INT4 GQA and comparing it to that of BF16 GQA, as CI represents FLOPS per byte. We compute the CI for QK^T and PV (as shown in Equation 1) as they take KV cache as an operand. Note that we disregard the Q load as it is negligible compared to the KV cache. We also ignore any intermediate data loads/stores that are not on global memory. Thus, the CI only takes into account the computation FLOPS and KV cache loads.

Equation (1)

Assuming that H_Q = 8 and H_KV = 1, CI for BF16 KV cache is 8 while CI for INT4 KV cache is 32. The CIs indicate that both BF16 and INT4 GQAs are memory bound (the peak CIs for BF16 tensor cores for A100 and H100 are 312 TF / 2 TB/s = 141 and 990 TF / 3.35 TB/s = 269; note that these TF numbers are without sparsity). Moreover, with INT4 KV cache, we should expect up to 4x performance improvement compared to BF16 GQA.

To enable INT4 KV cache support in GQA, we can dequantize the KV cache from INT4 to BF16 before passing it to the BF16 GQA operator. However, since KV cache is typically large, copying it from/to global memory can be costly. Moreover, decoding GQA is a memory bound operation (the memory unit is utilized much more heavily than the compute unit). Figure 2 shows the NCU profile of the FMHA CUTLASS BF16 GQA kernel in xFormers, which is one of the state of the art implementations of GQA. From the figure, it is obvious that memory is a bottleneck.

Figure 2 The NCU profile of the FMHA CUTLASS BF16 kernel in xFormers

A more efficient alternative is to fuse INT4 dequantization with the GQA operation (shown in Figure 3). In other words, having GQA read INT4 KV cache directly and perform the INT4 to BF16 conversion within the kernel. This change can potentially reduce the amount of global memory reads required for the KV cache, which could lead to a decrease in latency. We call this “INT4 GQA.”

Figure 3 The workflow of fused INT4 GQA

We list the state of the art implementations of GQA in the table below along with their features in Table 1.

Table 1 State of the art GQA implementations

Implementation	Denote	BF16 GQA	Fused INT4 GQA
Flash-Decoding (Triton implementation)	FD	Yes	Yes
Flash Attention (v2.3.3)	FA	Yes	No
CUDA baseline	CU	Yes	Yes

All implementations, except for CU, support both split-K and non split-K. CU only has the split-K implementation. Only FA has a heuristic in the backend to determine whether to run the split-K or non split-K kernel. For other implementations, users must explicitly choose which version to run. In this note, we focus on long context lengths (in our experiments, we use a context length of 8192) and therefore opt for the split-K version wherever possible.

As the baseline, we measured the performance of the state of the art GQA implementations on NVIDIA A100 and H100 GPUs. The latency (time in microseconds) and achieved bandwidth (GB/s) are reported in Table 2. Note that we ran a range of split-Ks (from 2 to 128 splits) and reported the best performance for each implementation. For all experiments, we use a context length of 8192. For INT4 GQA, we used row-wise quantization (i.e., num quantized groups = 1).

Table 2 Baseline GQA performance

On A100

Time (us)	BF16 GQA			INT4 GQA
Batch size	FD	FA	CU	FD	FA	CU
32	139	133	183	137	–	143
64	245	229	335	234	–	257
128	433	555	596	432	–	455
256	826	977	1127	815	–	866
512	1607	1670	2194	1581	–	1659

Effective Bandwidth (GB/s)	BF16 GQA			INT4 GQA
Batch size	FD	FA	CU	FD	FA	CU
32	965	1012	736	262	–	250
64	1097	1175	802	305	–	278
128	1240	968	901	331	–	314
256	1301	1100	954	351	–	331
512	1338	1287	980	362	–	345

On H100

Time (us)	BF16 GQA			INT4 GQA
Batch size	FD	FA	CU	FD	FA	CU
32	91	90	114	70	–	96
64	148	146	200	113	–	162
128	271	298	361	205	–	294
256	515	499	658	389	–	558
512	1000	1011	1260	756	–	1066

Effective Bandwidth (GB/s)	BF16 GQA			INT4 GQA
Batch size	FD	FA	CU	FD	FA	CU
32	1481	1496	1178	511	–	371
64	1815	1840	1345	631	–	443
128	1982	1802	1487	699	–	487
256	2087	2156	1634	736	–	513
512	2150	2127	1706	757	–	537

First, let’s discuss the BF16 GQA performance: CU ranks last in terms of performance among all implementations. FD and FA have comparable performance. When the batch size is less than or equal to 64, FA utilizes the split-K kernel and performs slightly better than FD. However, when the batch size is greater than 64, FD performs better.

The same trend holds true for INT4 GQAs. However, we did not measure the performance of FA as it does not support INT4 KV cache. FD outperforms CU for all cases.

When comparing the latencies of FD between BF16 and INT4 GQAs, we find that they are almost identical. This suggests that INT4 GQA is highly inefficient, which can be further confirmed by the significantly lower achievable bandwidth for INT4 GQA compared to BF16 GQA. The same trend is also true when looking at the performance of CU.

CUDA with Tensor Cores INT4 GQA Implementation

In this section, we briefly describe our baseline implementation which is CUDA with tensor cores INT4 GQA (CU). Each thread block processes only one KV head and a group of query heads from one input prompt. Therefore, each thread block performs mm(softmax(mm(Q, K^T) / sqrt(D)), V); notice that mm is being performed not bmm. Moreover, since this is a split-K implementation, tokens in the KV cache are split among different thread blocks. Note that each thread block contains 4 warps (each warp contains 32 threads for NVIDIA A100 and H100 GPUs). Work in each thread block is split among warps. Within each warp, we use the WMMA API to compute matrix multiplication on tensor cores. Figure 4 demonstrates the work partitioning in CU.

Figure 4 CU work partitioning

Optimizing CUDA with Tensor Cores Kernel of INT4 GQA

In this note, we discuss the optimizations that we have applied to the CUDA with tensor cores implementation of INT4 GQA (CU). The ideal goal is to improve the INT4 GQA performance by 4 times based on the CI analysis in the previous section. Note that the query size is negligible compared to the KV cache size when the context length is long.

In our analysis, we used the NVIDIA Nsight Compute (NCU) as the main profiler. Our general bottleneck elimination approach is to minimize the stall cycles. We applied 10 optimizations to INT4 GQA, three of which are specific for NVIDIA A100/H100 GPUs. These optimizations are well known CUDA optimization techniques which can be generalized to many applications.

It is worth noting that the reason that we choose to optimize the CUDA implementation rather than the Flash-Decoding implementation (FD) (which is Triton based) is because with CUDA, we have a better control of how the low-level instructions are being generated. Many optimization techniques that we apply such as, operating on tensor core fragments directly (Optimizations 7-9), cannot be done through Triton since it does not expose low-level details to developers. However, these optimizations can be integrated into the compiler-based solution to make the optimizations available to broader operators, which is indeed a part of our future plan.

Optimization 1: Unroll `K` Loads

Problem Analysis:

The NCU profile shows that during K loading, there are only 2 global loads followed by memory stalls at dequantize_permuted_int4. The memory stalls are the long scoreboard stalls which indicates the waits for global memory access. This suggests that the kernel does not issue sufficient memory loads

to hide the global load latency. The kernel issues data loading, and then waits to consume the data immediately causing the global load latency to be exposed. The stalls are shown in Figure 5.

Figure 5 K loading before unrolling (the numbers that the arrows point to are stall cycles caused by global memory wait)

Solution:

In the baseline implementation, we use uint32_t to load 8 INT4 K values in a single load and we perform 2 uint32_t loads in each iteration, which is 16 INT4 K values. To allow for a better global load latency hiding, we issue 8 uint32_t loads instead of two before consuming the K values in dequantize_permuted_int4. This allows the compiler to unroll the loads as well as reorder the instructions to hide the global load latency better. Figure 6 shows the NCU profile of K loading after unrolling. Comparing Figure 5 and Figure 6, we effectively reduce the stall cycles by unrolling the K loads.

Figure 6 K loading after unrolling (the numbers that the arrows point to are stall cycles caused by global memory wait)

Results:

Table 3 Performance of Optimization 1 for INT4 GQA (row-wise quantization)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CU		FD	CU		vs FD	vs CU baseline
	FD	Baseline	Opt 1	FD	Baseline	Opt 1	vs FD	vs CU baseline
32	137	143	134	262	250	267	1.02	1.07
64	234	257	237	305	278	302	0.99	1.09
128	432	455	422	331	314	339	1.02	1.08
256	815	866	806	351	331	355	1.01	1.07
512	1581	1659	1550	362	345	369	1.02	1.07

Optimization 2: Improve `P` Type Casting (FP32->BF16)

Problem Analysis:

Since the product of softmax(bmm(Q, K^T) / sqrt(D)) is FP32 (denoted as P in Figure 3), the kernel has to convert P from FP32 to BF16 before feeding it to the next bmm computation. The kernel performs the FP32 to BF16 conversion of P by copying the FP32 data from one location in shared memory to another location in shared memory. This causes stalls during the shared memory access (shown in Figure 7) which might be caused by (1) the shared memory indirection; and (2) the shared memory bank conflict since each thread accesses an 16-bit element (because of this, two threads can access the same memory bank simultaneously).

Figure 7 P type casting before Optimization 2 (the number that the arrow points to is stall cycles caused by shared memory wait)

Solution:

We use all threads in the thread block to do in-place type conversion. Each thread operates on two consecutive elements in order to avoid the shared memory bank conflict when storing BF16. All threads work on the same head (h) at the same time to guarantee correctness of the conversion. The in-place conversion steps are as follows:

Each thread loads 2 FP32 token elements from the same head from the shared memory into registers
Call __syncthreads() to make sure that every thread finishes reading the data
Each thread converts its data to 2 BF16 token elements and then stores the results to the same shared memory

Some optimizations that we apply to the implementation:

Use vector types (especially nv_bfloat2)
Unroll data loading/storing, i.e., performing multiple loads before calling __syncthreads() and performing multiple stores after __syncthreads()

After this optimization, long stalls are not observed during P type casting as shown in Figure 8.

Figure 8 P type casting after Optimization 2 (the numbers that the arrow points to are stall cycles caused by shared memory wait)

Culprits:

Since we unroll data loading/storing by using registers as an intermediate storage, the number of registers per thread increases resulting in reduced occupancy.

Results:

Table 4 Performance of Optimization 2 for INT4 GQA (row-wise quantization)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CU		FD	CU		vs FD	vs CU baseline
	FD	Baseline	Opt 2	FD	Baseline	Opt 2	vs FD	vs CU baseline
32	137	143	126	262	250	285	1.09	1.14
64	234	257	221	305	278	324	1.06	1.16
128	432	455	395	331	314	362	1.09	1.15
256	815	866	749	351	331	382	1.09	1.16
512	1581	1659	1435	362	345	399	1.10	1.16

Optimization 3: Remove Local Memory Usage for max `QK^T` computation

Problem Analysis:

During the softmax computation, the kernel has to compute max QK^T for each head. It uses a temporary “thread-local” storage for storing per-thread max QK^T results (one float value for each head). Depending on the compiler, the thread-local storage can be allocated on registers (on chip) or the local memory (off chip == global memory). Unfortunately, in the baseline, the thread-local storage resides in the local memory which is much slower than the registers (shown in Figure 9). We suspect that this is because the compiler cannot determine the indices of thread-local storage at compile time (since the number of heads (H) in the kernel is a runtime variable). Accessing local memory as if accessing registers can hurt the performance of the kernel.

Figure 9 Local memory access during max QK^T computation

Solution:

We realize that we do not need H (number of heads) floats as temporary storage per thread since each thread can compute max QK^T for only one head instead of all the heads. Thus, we only need one float per thread, which can be easily stored in a register. To accumulate the max results among warps, we use shared memory. This optimization eliminates the local memory usage during max QK^T computation.

Results:

Table 5 Performance of Optimization 3 for INT4 GQA (row-wise quantization)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CU		FD	CU		vs FD	vs CU baseline
	FD	Baseline	Opt 3	FD	Baseline	Opt 3	vs FD	vs CU baseline
32	137	143	119	262	250	300	1.14	1.20
64	234	257	206	305	278	348	1.14	1.25
128	432	455	368	331	314	389	1.17	1.24
256	815	866	696	351	331	411	1.17	1.24
512	1581	1659	1338	362	345	428	1.18	1.24

Optimization 4: Remove local memory usage for row sum

Problem Analysis:

Similar to Optimization 3, the local memory usage problem is also observed during the row sum computation in the softmax computation. Since local memory is off chip, accessing it as if accessing registers can hurt the performance of the kernel.

Solution:

We apply the same solution as the max QK^T computation for the row sum computation. That is to have each thread compute a row sum of only one head, which requires only one float per thread. This eliminates the need for local memory.

Results:

Table 6 Performance of Optimization 4 for INT4 GQA (row-wise quantization)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CU		FD	CU		vs FD	vs CU baseline
	FD	Baseline	Opt 4	FD	Baseline	Opt 4	vs FD	vs CU baseline
32	137	143	118	262	250	302	1.15	1.21
64	234	257	204	305	278	351	1.15	1.26
128	432	455	364	331	314	393	1.19	1.25
256	815	866	688	351	331	416	1.18	1.26
512	1581	1659	1328	362	345	431	1.19	1.25

Optimization 5: Add prefetch for `V` load

Problem Analysis:

The same issue as K loading is observed when loading V. That is, the kernel issues data loading, and then waits to consume the data immediately causing the global load latency to be exposed. However, when using the unrolling technique mentioned above, the compiler allocates the temporary buffer on local memory instead of registers causing a large slow down.

Solution:

We adopt the data prefetching technique for V loading. We load the next iteration V values immediately after the current iteration values are consumed. This allows the data loading to be overlapped with the PK computation resulting in better kernel performance.

Results:

Table 7 Performance of Optimization 5 for INT4 GQA (row-wise quantization)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CU		FD	CU		vs FD	vs CU baseline
	FD	Baseline	Opt 5	FD	Baseline	Opt 5	vs FD	vs CU baseline
32	137	143	109	262	250	327	1.25	1.31
64	234	257	194	305	278	370	1.21	1.33
128	432	455	345	331	314	414	1.25	1.32
256	815	866	649	351	331	441	1.26	1.33
512	1581	1659	1244	362	345	460	1.27	1.33

Optimization 6: Add Group-Wise INT4 (Groups = 4) with Vector Load

Problem Analysis:

Prior to this optimization, CU only supported row-wise INT4 quantization. That is, every column in each row shares the same scales. The scales of each row are stored in the first 4 bytes of each row as shown in Figure 10. In the kernel, each thread loads only one row at a time. Since each row contains 68 bytes (4 bytes for scales and 64 bytes for data), it cannot guarantee that every row aligns with a size of any vector type. Thus, vector loads cannot be used for loading the KV cache.

Figure 10 The layout of each row of INT4 KV cache with row-wise quantization

Solution:

We have implemented support for group-wise INT4 quantization with num groups = 4. In this case, columns in each row in the KV cache tensor are divided into 4 equal groups. Columns within the same group share the same scales for quantization/dequantization. The data layout for INT4 KV cache is shown in Figure 11. The scales for all groups are serialized and stored at the beginning of each row. The INT4 data is also serialized and laid out next to the scales.

Because the number of bytes in each row now becomes 80 bytes, we can use a vector type, i.e., uint2 in our case, to load data. (We do not use uint4 since each thread loads only 16 INT4s at a time due to the tensor core fragment size.) Vector load is generally better than scalar load since it does not cause extra byte loads.

Figure 11 The layout of each row of INT4 KV cache with row-wise quantization

Results:

Table 8 Performance of Optimization 6 for INT4 GQA (row-wise quantization)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CU		FD	CU		vs FD	vs CU baseline
	FD	Baseline	Opt 6	FD	Baseline	Opt 6	vs FD	vs CU baseline
32	137	143	111	262	250	322	1.23	1.29
64	234	257	192	305	278	372	1.22	1.34
128	432	455	346	331	314	414	1.25	1.32
256	815	866	642	351	331	446	1.27	1.35
512	1581	1659	1244	362	345	460	1.27	1.33

Table 9 Performance of Optimization 6 for INT4 GQA (group-wise quantization with num groups = 4)

Batch size	Time (us)		Bandwidth (GB/s)		Speed up
	FD	CUDA_WMMA	FD	CUDA_WMMA	vs FD
	FD	Opt 6	FD	Opt 6	vs FD
32	129	116	325	364	1.31
64	219	195	385	431	1.36
128	392	347	429	484	1.39
256	719	638	468	527	1.41
512	1375	1225	489	550	1.43

Optimization 7: Compute max `QK^T` From WMMA Fragment Directly (A100/H100 specific)

Problem Analysis:

We observe large stalls due to shared memory accessing during the max QK^T computation (showing as large short scoreboard stalls) as shown in Figure 12.

Figure 12 Stalls due to shared memory access during max QK^T computation (the number that the arrow points to is stall cycles caused by shared memory wait)

Solution:

We bypass shared memory when computing max QK^T by computing it from the WMMA fragment (i.e., the tensor core fragment) directly. The layout of the WMMA fragment is specific to the GPU architecture. In this optimization, we only enabled this optimization for the NVIDIA A100/H100 GPUs. Other GPUs will still use shared memory for the max QK^T computation. By bypassing shared memory, we effectively eliminate the stalls caused by shared memory access. The tensor core layout of the C fragment which is used for storing the QK^T results is shown in Figure 13.

Figure 13 C fragment (QK^T storage) tensor core layout on A100/H100

Table 10 Performance of Optimization 7 for INT4 GQA (row-wise quantization)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CU		FD	CU		vs FD	vs CU baseline
	FD	Baseline	Opt 7	FD	Baseline	Opt 7	vs FD	vs CU baseline
32	137	143	107	262	250	333	1.27	1.33
64	234	257	183	305	278	391	1.28	1.40
128	432	455	333	331	314	430	1.30	1.37
256	815	866	620	351	331	461	1.31	1.40
512	1581	1659	1206	362	345	475	1.31	1.38

Table 11 Performance of Optimization 7 for INT4 GQA (group-wise quantization with num groups = 4)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CUDA_WMMA		FD	CUDA_WMMA		vs FD	vs CUDA_WMMA Opt 6
	FD	Opt 6	Opt 7	FD	Opt 6	Opt 7	vs FD	vs CUDA_WMMA Opt 6
32	129	116	111	325	364	380	1.17	1.04
64	219	195	187	385	431	449	1.17	1.04
128	392	347	333	429	484	506	1.18	1.04
256	719	638	615	468	527	547	1.17	1.04
512	1375	1225	1184	489	550	569	1.16	1.03

Optimization 8: Write FP32->BF16 Results to `P` Fragment Directly (A100/H100 specific)

Problem Analysis:

During the FP32-BF16 conversion for the P fragment, the kernel loads the FP32 data from shared memory, does the conversion and then stores the BF16 data back to shared memory. Moreover, the conversion requires many thread block synchronizations (__syncthreads()).

Solution:

Due to the data partitioning design of the kernel, each warp performs only one pass through the P fragment. Thus, we do not have to write the conversion results back to the shared memory for future usage. To avoid writing the BF16 data to the shared memory and thread block synchronizations, we have each warp load the FP32 data of the P WMMA fragment from the shared memory, do the conversion and then write the BF16 data directly to the P fragment.

Note that this optimization is applied to only the NVIDIA A100 and H100 GPUs because the WMMA fragment layout is architecture dependent. For non-A100/H100 GPUs, the kernel will fallback to the original path.

The P fragment tensor core layout is shown in Figure 14. Note that this layout is specific to the NVIDIA A100/H100 GPU.

Figure 14 P fragment tensor core layout on A100/H100

Table 12 Performance of Optimization 8 for INT4 GQA (row-wise quantization)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CU		FD	CU		vs FD	vs CU baseline
	FD	Baseline	Opt 8	FD	Baseline	Opt 8	vs FD	vs CU baseline
32	137	143	101	262	250	353	1.35	1.41
64	234	257	174	305	278	410	1.34	1.47
128	432	455	317	331	314	451	1.36	1.43
256	815	866	590	351	331	485	1.38	1.47
512	1581	1659	1143	362	345	501	1.38	1.45

Table 13 Performance of Optimization 8 for INT4 GQA (group-wise quantization with num groups = 4)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CUDA_WMMA		FD	CUDA_WMMA		vs FD	vs CUDA_WMMA Opt 6
	FD	Opt 6	Opt 8	FD	Opt 6	Opt 8	vs FD	vs CUDA_WMMA Opt 6
32	129	116	106	325	364	396	1.22	1.09
64	219	195	180	385	431	467	1.21	1.08
128	392	347	319	429	484	528	1.23	1.09
256	719	638	596	468	527	565	1.21	1.07
512	1375	1225	1138	489	550	591	1.21	1.08

Optimization 9: Swizzle P Shared Memory Layouts (A100/H100 specific)

Problem Analysis:

We observe large shared memory bank conflicts during P loading. The amount of bank conflict depends on the memory access stride. For instance, for split-Ks = 32 and max seq length = 8192, we observed that only 4 out of 32 banks are being accessed in parallel (memory access stride = 256). From Figure 14, when all threads access element 0, threads that have the same threadIdx.x % 4 access the same bank.

Figure 15 P fragment in shared memory before swizzling

Solution:

We shuffle the layout of P load/store in the shared memory in such a way that avoids bank conflicts. In other words, we store the QK^T results (C fragment) and load them (P fragment) using the swizzled layout. Moreover, instead of using the original memory access stride which is dependent on the number of tokens per thread block, we use the fragment’s column size as the stride which is constant. Thus, the load and store of the P fragment is always contiguous.

The new layouts for the C and P fragments are shown in Figure 16. With the new layout, it is guaranteed that 16 banks are being accessed in parallel as shown in Figure 17.

Figure 16 The swizzled layouts of C and P fragments

Figure 17 P fragment in shared memory after swizzling

Table 14 Performance of Optimization 9 for INT4 GQA (row-wise quantization)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CU		FD	CU		vs FD	vs CU baseline
	FD	Baseline	Opt 9	FD	Baseline	Opt 9	vs FD	vs CU baseline
32	137	143	98	262	250	365	1.39	1.46
64	234	257	167	305	278	429	1.41	1.54
128	432	455	299	331	314	479	1.45	1.52
256	815	866	549	351	331	521	1.48	1.58
512	1581	1659	1060	362	345	540	1.49	1.56

Table 15 Performance of Optimization 9 for INT4 GQA (group-wise quantization with num groups = 4)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CUDA_WMMA		FD	CUDA_WMMA		vs FD	vs CUDA_WMMA Opt 6
	FD	Opt 6	Opt 9	FD	Opt 6	Opt 9	vs FD	vs CUDA_WMMA Opt 6
32	129	116	105	325	364	400	1.23	1.10
64	219	195	174	385	431	484	1.26	1.12
128	392	347	302	429	484	558	1.30	1.15
256	719	638	560	468	527	601	1.28	1.14
512	1375	1225	1065	489	550	632	1.29	1.15

Optimization 10: Pad Shared Memory for INT4 Dequantization

Problem Analysis:

Once the kernel reads the INT4 K or V cache from global memory, it performs dequantization and stores the results (BF16) in the shared memory. Then, the BF16 data is loaded to the WMMA fragment from shared memory (via the WMMA interface). We observed a large number of bank conflicts for both K and V accesses. For instance, for K stores, only 4 out of 32 banks are being accessed in parallel. For K loads, 16 banks are being accessed in parallel. The same also occurs for V stores and loads. See the figures in the solution section.

Solution:

We pad the shared memory to reduce the bank conflict. Specifically, we pad each row by 2. That is, the row stride of K becomes F_K + 2 and the row stride of V becomes F_N + 2 (F_K and F_N are the fixed widths of the K and V WMMA fragments, respectively). With this optimization, we are able to reduce the bank conflict by 1.8x as shown in Figure 18.

Figure 18 Bank conflicts before and after Optimization 10

After Optimization 10, for K stores, 32 banks are being accessed in parallel (shown in Figure 19), while for K loads, 29 banks are accessed in parallel (shown in Figure 20).

Figure 19 K fragment store shared memory layout without and with padding

Figure 20 K fragment load shared memory layout without and with padding

Table 16 Performance of Optimization 10 for INT4 GQA (row-wise quantization)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CU		FD	CU		vs FD	vs CU baseline
	FD	Baseline	Opt 10	FD	Baseline	Opt 10	vs FD	vs CU baseline
32	137	143	94	262	250	380	1.45	1.52
64	234	257	151	305	278	475	1.55	1.71
128	432	455	266	331	314	538	1.63	1.71
256	815	866	489	351	331	586	1.67	1.77
512	1581	1659	930	362	345	616	1.70	1.79

Table 17 Performance of Optimization 10 for INT4 GQA (group-wise quantization with num groups = 4)

Batch size	Time (us)			Bandwidth (GB/s)			Speed up
	FD	CUDA_WMMA		FD	CUDA_WMMA		vs FD	vs CUDA_WMMA Opt 6
	FD	Opt 6	Opt 10	FD	Opt 6	Opt 10	vs FD	vs CUDA_WMMA Opt 6
32	129	116	99	325	364	425	1.31	1.17
64	219	195	161	385	431	523	1.36	1.21
128	392	347	282	429	484	598	1.39	1.23
256	719	638	509	468	527	662	1.41	1.25
512	1375	1225	965	489	550	698	1.43	1.27

Performance Evaluation

Microbenchmark results

We also evaluated BF16 GQA performance using our optimized kernel (as shown in Table 19). CU still performs generally worse than FD and FA for BF16. This is expected since our optimizations are INT4 focused.

While INT4 GQA is still not as efficient as BF16 GQA (see the achieved bandwidths), it is important to note that when comparing FD BF16 GQA performance against CU INT4 GQA performance, we can see that the latency of INT4 is smaller than that of BF16.

Table 19 Performance of BF16 GQA and INT GQA after CU optimizations

On A100

Time (us)	BF16 GQA				INT4 GQA
Batch size	FD	FA	CU before	CU after	FD	FA	CU before	CU after
32	139	133	183	163	137	–	143	94
64	245	229	335	276	234	–	257	151
128	433	555	596	517	432	–	455	266
256	826	977	1127	999	815	–	866	489
512	1607	1670	2194	1879	1581	–	1659	930

Effective Bandwidth (GB/s)	BF16 GQA				INT4 GQA
Batch size	FD	FA	CU before	CU after	FD	FA	CU before	CU after
32	965	1012	736	824	262	–	250	380
64	1097	1175	802	972	305	–	278	475
128	1240	968	901	1039	331	–	314	538
256	1301	1100	954	1075	351	–	331	586
512	1338	1287	980	1144	362	–	345	616

On H100

Time (us)	BF16 GQA				INT4 GQA
Batch size	FD	FA	CU before	CU after	FD	FA	CU before	CU after
32	91	90	114	100	70	–	96	64
64	148	146	200	183	113	–	162	101
128	271	298	361	308	205	–	294	170
256	515	499	658	556	389	–	558	306
512	1000	1011	1260	1066	756	–	1066	575

Effective Bandwidth (GB/s)	BF16 GQA				INT4 GQA
Batch size	FD	FA	CU before	CU after	FD	FA	CU before	CU after
32	1481	1496	1178	1341	511	–	371	560
64	1815	1840	1345	1470	631	–	443	710
128	1982	1802	1487	1743	699	–	487	844
256	2087	2156	1634	1934	736	–	513	935
512	2150	2127	1706	2015	757	–	537	996

E2E results

We evaluated our optimized INT4 GQA kernel in Llama 2 70B on 8 H100 GPUs. We ran the model end-to-end, but only reported the decode latency. We use FP8 FFN (feed forward network) to emphasize the attention performance in the decoding phase. We vary the batch size from 1 to 256 and the context length from 2,048 (2K) to 16,384 (16K). The E2E performance results are shown in the figure below.

Figure 21 Meta Llama 2 decode latency (ms) comparison (BF16 GQA runs out of memory in large batch size configurations)

Code

If you are interested, please checkout our code here. If you have any questions, please feel free to open an issue on GitHub, and we will be happy to help. Your contributions are welcome!

Flexible and adaptive testing

A degradation rate metric to measure stability

LST-Bench implementation

Implications and looking ahead

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

What is Code Llama 70B and Mixtral 8x7B?

Harnessing the power of Code Llama 70B and Mistral models on SageMaker

Solution overview

Prerequisites

Code generation scenarios

Code Llama use cases with SageMaker

Generating a transformer model for natural language processing

Additional examples and use cases

Generating a reinforcement learning agent

Generating a distributed training script

Mixtral 8x7B use cases with SageMaker

Generating a hyperparameter tuning script for SageMaker

Code Transformation: Java to Python

AWS CDK code for a three-tier web application

Additional considerations

Clean up

Conclusion

References

About the Authors

Finance Organizations Detect Fraud in a Fraction of a Second

Telcos Simplify Complex Routing Operations

Biomedical Researchers Condense Drug Discovery Timelines

Utility Companies Build the Future of Clean Energy

Automakers Enable Safer, More Accessible, Self-Driving Vehicles

Retailers Improve Demand Forecasting

Public Sector Improves Disaster Preparedness

Accelerate AI Initiatives and Deliver Business Results

What is RAG?

What does Jina Embeddings v2 bring to RAG applications?

What is SageMaker JumpStart?

Solution overview

Connecting to a Jina Embeddings v2 endpoint

Preparing a dataset for indexing

Index text embeddings for vector search

Prompt a generative LLM endpoint

Query the LLM

Clean up

Conclusion

Connect with Jina AI

About the Authors

Get Ready to Rumble

Defying Gravity

Get in the Pass Lane

Mischief Managed

An efficient decoding Grouped-Query Attention with low-precision KV cache

Introduction

Background

GQA for LLM Inference

INT4 GQA

CUDA with Tensor Cores INT4 GQA Implementation

Optimizing CUDA with Tensor Cores Kernel of INT4 GQA

Optimization 1: Unroll K Loads

Optimization 2: Improve P Type Casting (FP32->BF16)

Optimization 3: Remove Local Memory Usage for max QKT computation

Optimization 4: Remove local memory usage for row sum

Optimization 5: Add prefetch for V load

Optimization 6: Add Group-Wise INT4 (Groups = 4) with Vector Load

Optimization 7: Compute max QKT From WMMA Fragment Directly (A100/H100 specific)

Optimization 8: Write FP32->BF16 Results to P Fragment Directly (A100/H100 specific)

Optimization 9: Swizzle P Shared Memory Layouts (A100/H100 specific)

Optimization 10: Pad Shared Memory for INT4 Dequantization

Performance Evaluation

Microbenchmark results

E2E results

Code

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.

Optimization 1: Unroll `K` Loads

Optimization 2: Improve `P` Type Casting (FP32->BF16)

Optimization 3: Remove Local Memory Usage for max `QK^T` computation

Optimization 5: Add prefetch for `V` load

Optimization 7: Compute max `QK^T` From WMMA Fragment Directly (A100/H100 specific)

Optimization 8: Write FP32->BF16 Results to `P` Fragment Directly (A100/H100 specific)