Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

Today, we are excited to announce that the Jina Embeddings v2 model, developed by Jina AI, is available for customers through Amazon SageMaker JumpStart to deploy with one click for running model inference. This state-of-the-art model supports an impressive 8,192-tokens context length. You can deploy this model with SageMaker JumpStart, a machine learning (ML) hub with foundation models, built-in algorithms, and pre-built ML solutions that you can deploy with just a few clicks.

Text embedding refers to the process of transforming text into numerical representations that reside in a high-dimensional vector space. Text embeddings have a broad range of applications in enterprise artificial intelligence (AI), including the following:

  • Multimodal search for ecommerce
  • Content personalization
  • Recommender systems
  • Data analytics

Jina Embeddings v2 is a state-of-the-art collection of text embedding models, trained by Berlin-based Jina AI, that boast high performance on several public benchmarks.

In this post, we walk through how to discover and deploy the jina-embeddings-v2 model as part of a Retrieval Augmented Generation (RAG)-based question answering system in SageMaker JumpStart. You can use this tutorial as a starting point for a variety of chatbot-based solutions for customer service, internal support, and question answering systems based on internal and private documents.

What is RAG?

RAG is the process of optimizing the output of a large language model (LLM) so it references an authoritative knowledge base outside of its training data sources before generating a response.

LLMs are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It’s a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

What does Jina Embeddings v2 bring to RAG applications?

A RAG system uses a vector database to serve as a knowledge retriever. It must extract a query from a user’s prompt and send it to a vector database to reliably find as much semantic information as possible. The following diagram illustrates the architecture of a RAG application with Jina AI and Amazon SageMaker.

Jina Embeddings v2 is the preferred choice for experienced ML scientists for the following reasons:

  • State-of-the-art performance – We have shown on various text embedding benchmarks that Jina Embeddings v2 models excel on tasks such as classification, reranking, summarization, and retrieval. Some of the benchmarks demonstrating their performance are MTEB, an independent study of combining embedding models with reranking models, and the LoCo benchmark by a Stanford University group.
  • Long input-context length – Jina Embeddings v2 models support 8,192 input tokens. This makes the models especially powerful at tasks such as clustering for long documents like legal text or product documentation.
  • Support for bilingual text input Recent research shows that multilingual models without specific language training show strong biases towards English grammatical structures in embeddings. Jina AI’s bilingual embedding models include jina-embeddings-v2-base-de, jina-embeddings-v2-base-zh, jina-embeddings-v2-base-es, and jina-embeddings-v2-base-code. They were trained to encode texts in a combination of English-German, English-Chinese, English-Spanish, and English-Code, respectively, allowing the use of either language as the query or target document in retrieval applications.
  • Cost-effectiveness of operating – Jina Embeddings v2 provides high performance on information retrieval tasks with relatively small models and compact embedding vectors. For example, jina-embeddings-v2-base-de has a size of 322 MB with a performance score of 60.1%. A smaller vector size means a great amount of cost savings while storing them in a vector database.

What is SageMaker JumpStart?

With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. Developers can deploy foundation models to dedicated SageMaker instances within a network-isolated environment, and customize models using SageMaker for model training and deployment.

You can now discover and deploy a Jina Embeddings v2 model with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines and Amazon SageMaker Debugger. With SageMaker JumpStart, the model is deployed in an AWS secure environment and under your VPC controls, helping provide data security.

Jina Embeddings models are available in AWS Marketplace so you can integrate them directly into your deployments when working in SageMaker.

AWS Marketplace enables you to find third-party software, data, and services that run on AWS and manage them from a centralized location. AWS Marketplace includes thousands of software listings and simplifies software licensing and procurement with flexible pricing options and multiple deployment methods.

Solution overview

We’ve prepared a notebook that constructs and runs a RAG question answering system using Jina Embeddings and the Mixtral 8x7B LLM in SageMaker JumpStart.

In the following sections, we give you an overview of the main steps needed to bring a RAG application to life using generative AI models on SageMaker JumpStart. Although we omit some of the boilerplate code and installation steps in this post for reasons of readability, you can access the full Python notebook to run yourself.

Connecting to a Jina Embeddings v2 endpoint

To start using Jina Embeddings v2 models, complete the following steps:

  1. In SageMaker Studio, choose JumpStart in the navigation pane.
  2. Search for “jina” and you will see the provider page link and models available from Jina AI.
  3. Choose Jina Embeddings v2 Base – en, which is Jina AI’s English language embeddings model.
  4. Choose Deploy.
  5. In the dialog that appears, choose Subscribe, which will redirect you to the model’s AWS Marketplace listing, where you can subscribe to the model after accepting the terms of usage.
  6. After subscribing, return to the Sagemaker Studio and choose Deploy.
  7. You will be redirected to the endpoint configuration page, where you can select the instance most suitable for your use case and provide a name for the endpoint.
  8. Choose Deploy.

After you create the endpoint, you can connect to it with the following code snippet:

from jina_sagemaker import Client
 
client = Client(region_name=region)
# Make sure that you’ve given the same name my-jina-embeddings-endpoint to the Jumpstart endpoint in the previous step.
endpoint_name = "my-jina-embeddings-endpoint"
 
client.connect_to_endpoint(endpoint_name=endpoint_name)

Preparing a dataset for indexing

In this post, we use a public dataset from Kaggle (CC0: Public Domain) that contains audio transcripts from the popular YouTube channel Kurzgesagt – In a Nutshell, which has over 20 million subscribers.

Each row in this dataset contains the title of a video, its URL, and the corresponding text transcript.

Enter the following code:

df.head()

Because the transcript of these videos can be quite long (around 10 minutes), in order to find only the relevant content for answering users’ questions and not other parts of the transcripts that are unrelated, you can chunk each of these transcripts before indexing them:

def chunk_text(text, max_words=1024):
    """
    Divide text into chunks where each chunk contains the maximum number of full sentences under `max_words`.
    """
    sentences = text.split('.')
    chunk = []
    word_count = 0
 
    for sentence in sentences:
        sentence = sentence.strip(".")
        if not sentence:
          continue
 
        words_in_sentence = len(sentence.split())
        if word_count + words_in_sentence <= max_words:
            chunk.append(sentence)
            word_count += words_in_sentence
        else:
            # Yield the current chunk and start a new one
            if chunk:
              yield '. '.join(chunk).strip() + '.'
            chunk = [sentence]
            word_count = words_in_sentence
 
    # Yield the last chunk if it's not empty
    if chunk:
        yield ' '.join(chunk).strip() + '.'

The parameter max_words defines the maximum number of full words that can be in a chunk of indexed text. Many chunking strategies exist in academic and non-peer-reviewed literature that are more sophisticated than a simple word limit. However, for the purpose of simplicity, we use this technique in this post.

Index text embeddings for vector search

After you chunk the transcript text, you obtain embeddings for each chunk and link each chunk back to the original transcript and video title:

def generate_embeddings(text_df):
    """
    Generate an embedding for each chunk created in the previous step.
    """

    chunks = list(chunk_text(text_df['Text']))
    embeddings = []
 
    for i, chunk in enumerate(chunks):
      response = client.embed(texts=[chunk])
      chunk_embedding = response[0]['embedding']
      embeddings.append(np.array(chunk_embedding))
 
    text_df['chunks'] = chunks
    text_df['embeddings'] = embeddings
    return text_df
 
print("Embedding text chunks ...")
df = df.progress_apply(generate_embeddings, axis=1)

The dataframe df contains a column titled embeddings that can be put into any vector database of your choice. Embeddings can then be retrieved from the vector database using a function such as find_most_similar_transcript_segment(query, n), which will retrieve the n closest documents to the given input query by a user.

Prompt a generative LLM endpoint

For question answering based on an LLM, you can use the Mistral 7B-Instruct model on SageMaker JumpStart:

from sagemaker.jumpstart.model import JumpStartModel
from string import Template

# Define the LLM to be used and deploy through Jumpstart.
jumpstart_model = JumpStartModel(model_id="huggingface-llm-mistral-7b-instruct", role=role)
model_predictor = jumpstart_model.deploy()

# Define the prompt template to be passed to the LLM
prompt_template = Template("""
  <s>[INST] Answer the question below only using the given context.
  The question from the user is based on transcripts of videos from a YouTube
    channel.
  The context is presented as a ranked list of information in the form of
    (video-title, transcript-segment), that is relevant for answering the
    user's question.
  The answer should only use the presented context. If the question cannot be
    answered based on the context, say so.
 
  Context:
  1. Video-title: $title_1, transcript-segment: $segment_1
  2. Video-title: $title_2, transcript-segment: $segment_2
  3. Video-title: $title_3, transcript-segment: $segment_3
 
  Question: $question
 
  Answer: [/INST]
""")

Query the LLM

Now, for a query sent by a user, you first find the semantically closest n chunks of transcripts from any video of Kurzgesagt (using vector distance between embeddings of chunks and the users’ query), and provide those chunks as context to the LLM for answering the users’ query:

# Define the query and insert it into the prompt template together with the context to be used to answer the question
question = "Can climate change be reversed by individuals' actions?"
search_results = find_most_similar_transcript_segment(question)
 
prompt_for_llm = prompt_template.substitute(
    question = question,
    title_1 = df.iloc[search_results[0][1]]["Title"].strip(),
    segment_1 = search_results[0][0],
    title_2 = df.iloc[search_results[1][1]]["Title"].strip(),
    segment_2 = search_results[1][0],
    title_3 = df.iloc[search_results[2][1]]["Title"].strip(),
    segment_3 = search_results[2][0]
)

# Generate the answer to the question passed in the propt
payload = {"inputs": prompt_for_llm}
model_predictor.predict(payload)

Based on the preceding question, the LLM might respond with an answer such as the following:

Based on the provided context, it does not seem that individuals can solve climate change solely through their personal actions. While personal actions such as using renewable energy sources and reducing consumption can contribute to mitigating climate change, the context suggests that larger systemic changes are necessary to address the issue fully.

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped. Use the following code:

model_predictor.delete_model()
model_predictor.delete_endpoint()

Conclusion

By taking advantage of the features of Jina Embeddings v2 to develop RAG applications, together with the streamlined access to state-of-the-art models on SageMaker JumpStart, developers and businesses are now empowered to create sophisticated AI solutions with ease.

Jina Embeddings v2’s extended context length, support for bilingual documents, and small model size enables enterprises to quickly build natural language processing use cases based on their internal datasets without relying on external APIs.

Get started with SageMaker JumpStart today, and refer to the GitHub repository for the complete code to run this sample.

Connect with Jina AI

Jina AI remains committed to leadership in bringing affordable and accessible AI embeddings technology to the world. Our state-of-the-art text embedding models support English and Chinese and soon will support German, with other languages to follow.

For more information about Jina AI’s offerings, check out the Jina AI website or join our community on Discord.


About the Authors

Francesco Kruk is Product Managment intern at Jina AI and is completing his Master’s at ETH Zurich in Management, Technology, and Economics. With a strong business background and his knowledge in machine learning, Francesco helps customers implement RAG solutions using Jina Embeddings in an impactful way.

Saahil Ognawala is Head of Product at Jina AI based in Munich, Germany. He leads the development of search foundation models and collaborates with clients worldwide to enable quick and efficient deployment of state-of-the-art generative AI products. With an academic background in machine learning, Saahil is now interested in scaled applications of generative AI in the knowledge economy.

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS based in Munich, Germany. Roy helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS. Roy is passionate about computational optimization problems and improving the performance of AI workloads.

Read More

Here Comes a New Challenger: ‘Street Fighter 6’ Joins GeForce NOW

Here Comes a New Challenger: ‘Street Fighter 6’ Joins GeForce NOW

Capcom’s latest entry in the iconic Street Fighter series, Street Fighter 6, punches its way into the cloud this GFN Thursday. The game, along with Ubisoft’s XDefiant, leads six new games joining the GeForce NOW library.

A new reward makes its way to the cloud gaming service’s Ultimate and Priority members. For a limited time, GeForce NOW members who are new to Xbox PC Game Pass can get three months of Microsoft’s subscription service free, just by opting into the GeForce NOW Rewards program.

Plus, make sure to follow @NVIDIAGFN on X to see picturesque in-game locations from where members are sending their #GreetingsfromGFN.

Get Ready to Rumble

Street Fighter 6 on GeForce NOW
Are Ryu ready?

Unleash the ultimate Hadoken with Street Fighter 6 on GeForce NOW. The renowned 2D fighting game returns with intense battles, special moves, combos and Super Art attacks to defeat opponents. With a roster of 22 iconic fighters, including classic World Warriors like Ryu, Chun-Li, Guile and Akuma, plus all-new characters like Kimberly, Jamie, Marisa and Manon, there’s no better time to hit the streets.

The newest installment introduces innovative features and enhanced visuals across three distinct game modes — Fighting Ground, World Tour and Battle Hub — for gamers to level up and put their skills to the test. The game’s blend of classic mechanics and fresh enhancements is captivating longtime fans and newcomers alike.

Become a World Warrior in the cloud with a GeForce NOW Ultimate membership and stream all the fighting glory at up to stunning 4K resolution. Witness every punch, kick and Hadoken with others by hopping online for some head-to-head competition.

Defying Gravity

XDefiant on GeForce NOW
Discover which faction will reign supreme in “XDefiant.”

XDefiant, a free-to-play first-person shooter, combines intense gunplay with strategic team dynamics. Set in a world where factions inspired by iconic Ubisoft franchises clash, the game enables players to customize their loadouts and engage in fast-paced battles. Choose stealthy tactics or all-out aggression for a diverse and thrilling multiplayer experience.

Prepare for adrenaline-fueled firefights and tactical showdowns at up to 240 frames per second with an Ultimate membership. Every frame counts in the fight against other factions.

Get in the Pass Lane

PC Game Pass member reward on GeForce NOW
It’s rewarding to be a GeForce NOW member.

Get ready for a summer of gaming. GeForce NOW Ultimate and Priority members new to PC Game Pass and part of the GeForce NOW Rewards program can now receive three free months of Microsoft’s service.

With PC Game Pass and GeForce NOW, members can play high-quality Xbox PC titles with the power of an NVIDIA GeForce RTX server in the cloud. Jump into the action in iconic franchises like Starfield, Forza Motorsport and Remnant II with support for more titles added every GFN Thursday.

This special offer is available for a limited time, and only for GeForce NOW members new to PC Game Pass.

Mischief Managed

Sneak out on GeForce NOW
Hide and seek on an epic scale.

Get into all kinds of mischief and fun in Sneak Out from Kinguin Studios. Enter the Haunted Castle and prepare to hunt, hide or prank, causing all kinds of hilarious mayhem while trying to win a deadly game of hide and seek.

Check out the list of new games this week:

  • Killer Klowns from Outer Space: The Game (New release on Steam, June 4)
  • Autopsy Simulator (New release on Steam, June 6)
  • Chornobyl Liquidators (New release on Steam, June 6)
  • Sneak Out (New release on Steam, June 6)
  • Farm Together 2 (Steam)
  • Street Fighter 6 (Steam)
  • XDefiant (Ubisoft)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

INT4 Decoding GQA CUDA Optimizations for LLM Inference

INT4 Decoding GQA CUDA Optimizations for LLM Inference

An efficient decoding Grouped-Query Attention with low-precision KV cache

Introduction

Generative AI has taken the world by storm with its ability to generate content like humans. Many of these generative AI tools are powered by large language models (LLMs), like Meta Llama models and OpenAI’s ChatGPT. One of the main challenges of LLMs is supporting large “context lengths” (also known as “sequence lengths”). The context length refers to the number of tokens that the model uses to understand the input context and generate responses. Longer context lengths generally translate into higher precision and quality in the responses. However, long context lengths are compute and memory intensive. This is mainly due to the following reasons:

  • The computational complexity of attention layers increases proportionally with the context length (the growth rate depends on the attention algorithm). As a result, when using long context lengths, the attention layers can become a bottleneck, particularly during the prefill phase where attentions are compute bound.
  • The KV cache size grows linearly with the context length, thus, putting higher pressure on the memory requirement and consequently slowing down the already memory-bound attention decoding. Moreover, since the memory capacity is limited, the batch size reduces when the KV cache gets bigger, which generally results in a drop in throughput.

The computational complexity growth is difficult to solve compared to the other problem mentioned above. One way to address the KV cache size growth problem is to use low precision KV cache. From our experiments, group-wise INT4 quantization provides comparable results in terms of accuracy compared to BF16 KV cache during the decode phase in Meta Llama 2 inference. However, we did not observe any latency improvement, despite reading 4x lesser data in attention decoding layers. This means that the INT4 attention is 4x less efficient at utilizing precious HBM bandwidth than BF16 attention.

In this note, we discuss the CUDA optimizations that we applied to INT4 GQA (grouped-query attention – the attention layer that we use in the LLM inference phase) to improve its performance by up to 1.8x on the NVIDIA A100 GPU and 1.9x on the NVIDIA H100 GPU.

  • The optimized CUDA INT4 GQA outperformed INT4 Flash-Decoding GQA (the best performing INT4 GQA that we used in the experiment mentioned above) by 1.4x-1.7x on A100 and 1.09x-1.3x on H100.
  • The optimized CUDA INT4 GQA performs better than BF16 Flash-Decoding GQA by 1.5x-1.7x on A100 and 1.4x-1.7x on H100.

Background

GQA for LLM Inference

Grouped-Query Attention (GQA) is a variant of multi-head attention (MHA) where each KV cache head is shared across a group of query heads. Our LLM inference adopts GQA as an attention layer in both the prefill and decode phases in order to reduce the capacity requirement for the KV cache. We use multiple GPUs in inference where the KV cache and query heads are distributed across GPUs. Each GPU runs an attention layer with a single KV head and a group of Q heads. Therefore, when viewed from a single GPU perspective, the GQA component can also be described as MQA (Multi-Query Attention).

The simplified workflow of decoding GQA is illustrated in Figure 1. GQA takes three main inputs: input query (denoted Q), K cache (denoted K), and V cache (denoted V). Our current GQA inference uses BF16 for Q, K, and V.

  • Q is a 4D BF16 tensor of shape (B, 1, HQ, D)
  • K is a 4D BF16 tensor of shape (B, Tmax, HKV, D)
  • V is a 4D BF16 tensor of shape (B, Tmax, HKV, D)

where

  • B is the batch size (the number of input prompts)
  • HQ is the number of query heads
  • HKV is the number of KV heads (HQ must be divisible by HKV)
  • Tmax is the maximum context length
  • D is the head dimension (fixed to 128)

GQA is simply bmm(softmax(bmm(Q, KT) / sqrt(D)), V). This yields a single output tensor (denoted as O) which is a 4D BF16 tensor that has the same shape as Q. Note that matrix multiplications are performed using BF16, however, accumulation and softmax are carried out in FP32. We call this “BF16 GQA” as the KV cache is BF16.

Figure 1: The simplified workflow of BF16 GQA for LLM inference

Figure 1 The simplified workflow of BF16 GQA for LLM inference

INT4 GQA

To further reduce the size of the KV cache, we explore the possibility of using INT4 for KV cache instead of BF16. We estimate the potential performance improvement by calculating the computational intensity (CI) of INT4 GQA and comparing it to that of BF16 GQA, as CI represents FLOPS per byte. We compute the CI for QKT and PV (as shown in Equation 1) as they take KV cache as an operand. Note that we disregard the Q load as it is negligible compared to the KV cache. We also ignore any intermediate data loads/stores that are not on global memory. Thus, the CI only takes into account the computation FLOPS and KV cache loads.

Equation 1

Equation (1)

Assuming that HQ = 8 and HKV = 1, CI for BF16 KV cache is 8 while CI for INT4 KV cache is 32. The CIs indicate that both BF16 and INT4 GQAs are memory bound (the peak CIs for BF16 tensor cores for A100 and H100 are 312 TF / 2 TB/s = 141 and 990 TF / 3.35 TB/s = 269; note that these TF numbers are without sparsity). Moreover, with INT4 KV cache, we should expect up to 4x performance improvement compared to BF16 GQA.

To enable INT4 KV cache support in GQA, we can dequantize the KV cache from INT4 to BF16 before passing it to the BF16 GQA operator. However, since KV cache is typically large, copying it from/to global memory can be costly. Moreover, decoding GQA is a memory bound operation (the memory unit is utilized much more heavily than the compute unit). Figure 2 shows the NCU profile of the FMHA CUTLASS BF16 GQA kernel in xFormers, which is one of the state of the art implementations of GQA. From the figure, it is obvious that memory is a bottleneck.

Figure 2: The NCU profile of the FMHA CUTLASS BF16 kernel in xFormers

Figure 2 The NCU profile of the FMHA CUTLASS BF16 kernel in xFormers

A more efficient alternative is to fuse INT4 dequantization with the GQA operation (shown in Figure 3). In other words, having GQA read INT4 KV cache directly and perform the INT4 to BF16 conversion within the kernel. This change can potentially reduce the amount of global memory reads required for the KV cache, which could lead to a decrease in latency. We call this “INT4 GQA.”

Figure 3: The workflow of fused INT4 GQA

Figure 3 The workflow of fused INT4 GQA

We list the state of the art implementations of GQA in the table below along with their features in Table 1.

Table 1 State of the art GQA implementations

Implementation Denote BF16 GQA Fused INT4 GQA
Flash-Decoding (Triton implementation) FD Yes Yes
Flash Attention (v2.3.3) FA Yes No
CUDA baseline CU Yes Yes

All implementations, except for CU, support both split-K and non split-K. CU only has the split-K implementation. Only FA has a heuristic in the backend to determine whether to run the split-K or non split-K kernel. For other implementations, users must explicitly choose which version to run. In this note, we focus on long context lengths (in our experiments, we use a context length of 8192) and therefore opt for the split-K version wherever possible.

As the baseline, we measured the performance of the state of the art GQA implementations on NVIDIA A100 and H100 GPUs. The latency (time in microseconds) and achieved bandwidth (GB/s) are reported in Table 2. Note that we ran a range of split-Ks (from 2 to 128 splits) and reported the best performance for each implementation. For all experiments, we use a context length of 8192. For INT4 GQA, we used row-wise quantization (i.e., num quantized groups = 1).

Table 2 Baseline GQA performance

On A100

Time (us) BF16 GQA INT4 GQA
Batch size FD FA CU FD FA CU
32 139 133 183 137 143
64 245 229 335 234 257
128 433 555 596 432 455
256 826 977 1127 815 866
512 1607 1670 2194 1581 1659
Effective Bandwidth (GB/s) BF16 GQA INT4 GQA
Batch size FD FA CU FD FA CU
32 965 1012 736 262 250
64 1097 1175 802 305 278
128 1240 968 901 331 314
256 1301 1100 954 351 331
512 1338 1287 980 362 345

On H100

Time (us) BF16 GQA INT4 GQA
Batch size FD FA CU FD FA CU
32 91 90 114 70 96
64 148 146 200 113 162
128 271 298 361 205 294
256 515 499 658 389 558
512 1000 1011 1260 756 1066
Effective Bandwidth (GB/s) BF16 GQA INT4 GQA
Batch size FD FA CU FD FA CU
32 1481 1496 1178 511 371
64 1815 1840 1345 631 443
128 1982 1802 1487 699 487
256 2087 2156 1634 736 513
512 2150 2127 1706 757 537

First, let’s discuss the BF16 GQA performance: CU ranks last in terms of performance among all implementations. FD and FA have comparable performance. When the batch size is less than or equal to 64, FA utilizes the split-K kernel and performs slightly better than FD. However, when the batch size is greater than 64, FD performs better.

The same trend holds true for INT4 GQAs. However, we did not measure the performance of FA as it does not support INT4 KV cache. FD outperforms CU for all cases.

When comparing the latencies of FD between BF16 and INT4 GQAs, we find that they are almost identical. This suggests that INT4 GQA is highly inefficient, which can be further confirmed by the significantly lower achievable bandwidth for INT4 GQA compared to BF16 GQA. The same trend is also true when looking at the performance of CU.

CUDA with Tensor Cores INT4 GQA Implementation

In this section, we briefly describe our baseline implementation which is CUDA with tensor cores INT4 GQA (CU). Each thread block processes only one KV head and a group of query heads from one input prompt. Therefore, each thread block performs mm(softmax(mm(Q, KT) / sqrt(D)), V); notice that mm is being performed not bmm. Moreover, since this is a split-K implementation, tokens in the KV cache are split among different thread blocks. Note that each thread block contains 4 warps (each warp contains 32 threads for NVIDIA A100 and H100 GPUs). Work in each thread block is split among warps. Within each warp, we use the WMMA API to compute matrix multiplication on tensor cores. Figure 4 demonstrates the work partitioning in CU.

Figure 4: CU work partitioning

Figure 4 CU work partitioning

Optimizing CUDA with Tensor Cores Kernel of INT4 GQA

In this note, we discuss the optimizations that we have applied to the CUDA with tensor cores implementation of INT4 GQA (CU). The ideal goal is to improve the INT4 GQA performance by 4 times based on the CI analysis in the previous section. Note that the query size is negligible compared to the KV cache size when the context length is long.

In our analysis, we used the NVIDIA Nsight Compute (NCU) as the main profiler. Our general bottleneck elimination approach is to minimize the stall cycles. We applied 10 optimizations to INT4 GQA, three of which are specific for NVIDIA A100/H100 GPUs. These optimizations are well known CUDA optimization techniques which can be generalized to many applications.

It is worth noting that the reason that we choose to optimize the CUDA implementation rather than the Flash-Decoding implementation (FD) (which is Triton based) is because with CUDA, we have a better control of how the low-level instructions are being generated. Many optimization techniques that we apply such as, operating on tensor core fragments directly (Optimizations 7-9), cannot be done through Triton since it does not expose low-level details to developers. However, these optimizations can be integrated into the compiler-based solution to make the optimizations available to broader operators, which is indeed a part of our future plan.

Optimization 1: Unroll K Loads

Problem Analysis:

The NCU profile shows that during K loading, there are only 2 global loads followed by memory stalls at dequantize_permuted_int4. The memory stalls are the long scoreboard stalls which indicates the waits for global memory access. This suggests that the kernel does not issue sufficient memory loads

to hide the global load latency. The kernel issues data loading, and then waits to consume the data immediately causing the global load latency to be exposed. The stalls are shown in Figure 5.

Figure 5: K loading before unrolling

Figure 5 K loading before unrolling (the numbers that the arrows point to are stall cycles caused by global memory wait)

Solution:

In the baseline implementation, we use uint32_t to load 8 INT4 K values in a single load and we perform 2 uint32_t loads in each iteration, which is 16 INT4 K values. To allow for a better global load latency hiding, we issue 8 uint32_t loads instead of two before consuming the K values in dequantize_permuted_int4. This allows the compiler to unroll the loads as well as reorder the instructions to hide the global load latency better. Figure 6 shows the NCU profile of K loading after unrolling. Comparing Figure 5 and Figure 6, we effectively reduce the stall cycles by unrolling the K loads.

Figure 6: K loading after unrolling

Figure 6 K loading after unrolling (the numbers that the arrows point to are stall cycles caused by global memory wait)

Results:

Table 3 Performance of Optimization 1 for INT4 GQA (row-wise quantization)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CU FD CU vs FD vs CU baseline
Baseline Opt 1 Baseline Opt 1
32 137 143 134 262 250 267 1.02 1.07
64 234 257 237 305 278 302 0.99 1.09
128 432 455 422 331 314 339 1.02 1.08
256 815 866 806 351 331 355 1.01 1.07
512 1581 1659 1550 362 345 369 1.02 1.07

Optimization 2: Improve P Type Casting (FP32->BF16)

Problem Analysis:

Since the product of softmax(bmm(Q, KT) / sqrt(D)) is FP32 (denoted as P in Figure 3), the kernel has to convert P from FP32 to BF16 before feeding it to the next bmm computation. The kernel performs the FP32 to BF16 conversion of P by copying the FP32 data from one location in shared memory to another location in shared memory. This causes stalls during the shared memory access (shown in Figure 7) which might be caused by (1) the shared memory indirection; and (2) the shared memory bank conflict since each thread accesses an 16-bit element (because of this, two threads can access the same memory bank simultaneously).

Figure 7: P type casting before Optimization 2

Figure 7 P type casting before Optimization 2 (the number that the arrow points to is stall cycles caused by shared memory wait)

Solution:

We use all threads in the thread block to do in-place type conversion. Each thread operates on two consecutive elements in order to avoid the shared memory bank conflict when storing BF16. All threads work on the same head (h) at the same time to guarantee correctness of the conversion. The in-place conversion steps are as follows:

  1. Each thread loads 2 FP32 token elements from the same head from the shared memory into registers
  2. Call __syncthreads() to make sure that every thread finishes reading the data
  3. Each thread converts its data to 2 BF16 token elements and then stores the results to the same shared memory

Some optimizations that we apply to the implementation:

  • Use vector types (especially nv_bfloat2)
  • Unroll data loading/storing, i.e., performing multiple loads before calling __syncthreads() and performing multiple stores after __syncthreads()

After this optimization, long stalls are not observed during P type casting as shown in Figure 8.

Figure 8: P type casting after Optimization 2

Figure 8 P type casting after Optimization 2 (the numbers that the arrow points to are stall cycles caused by shared memory wait)

Culprits:

Since we unroll data loading/storing by using registers as an intermediate storage, the number of registers per thread increases resulting in reduced occupancy.

Results:

Table 4 Performance of Optimization 2 for INT4 GQA (row-wise quantization)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CU FD CU vs FD vs CU baseline
Baseline Opt 2 Baseline Opt 2
32 137 143 126 262 250 285 1.09 1.14
64 234 257 221 305 278 324 1.06 1.16
128 432 455 395 331 314 362 1.09 1.15
256 815 866 749 351 331 382 1.09 1.16
512 1581 1659 1435 362 345 399 1.10 1.16

Optimization 3: Remove Local Memory Usage for max QKT computation

Problem Analysis:

During the softmax computation, the kernel has to compute max QKT for each head. It uses a temporary “thread-local” storage for storing per-thread max QKT results (one float value for each head). Depending on the compiler, the thread-local storage can be allocated on registers (on chip) or the local memory (off chip == global memory). Unfortunately, in the baseline, the thread-local storage resides in the local memory which is much slower than the registers (shown in Figure 9). We suspect that this is because the compiler cannot determine the indices of thread-local storage at compile time (since the number of heads (H) in the kernel is a runtime variable). Accessing local memory as if accessing registers can hurt the performance of the kernel.

Figure 9: Local memory access during max QKT computation

Figure 9 Local memory access during max QKT computation

Solution:

We realize that we do not need H (number of heads) floats as temporary storage per thread since each thread can compute max QKT for only one head instead of all the heads. Thus, we only need one float per thread, which can be easily stored in a register. To accumulate the max results among warps, we use shared memory. This optimization eliminates the local memory usage during max QKT computation.

Results:

Table 5 Performance of Optimization 3 for INT4 GQA (row-wise quantization)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CU FD CU vs FD vs CU baseline
Baseline Opt 3 Baseline Opt 3
32 137 143 119 262 250 300 1.14 1.20
64 234 257 206 305 278 348 1.14 1.25
128 432 455 368 331 314 389 1.17 1.24
256 815 866 696 351 331 411 1.17 1.24
512 1581 1659 1338 362 345 428 1.18 1.24

Optimization 4: Remove local memory usage for row sum

Problem Analysis:

Similar to Optimization 3, the local memory usage problem is also observed during the row sum computation in the softmax computation. Since local memory is off chip, accessing it as if accessing registers can hurt the performance of the kernel.

Solution:

We apply the same solution as the max QKT computation for the row sum computation. That is to have each thread compute a row sum of only one head, which requires only one float per thread. This eliminates the need for local memory.

Results:

Table 6 Performance of Optimization 4 for INT4 GQA (row-wise quantization)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CU FD CU vs FD vs CU baseline
Baseline Opt 4 Baseline Opt 4
32 137 143 118 262 250 302 1.15 1.21
64 234 257 204 305 278 351 1.15 1.26
128 432 455 364 331 314 393 1.19 1.25
256 815 866 688 351 331 416 1.18 1.26
512 1581 1659 1328 362 345 431 1.19 1.25

Optimization 5: Add prefetch for V load

Problem Analysis:

The same issue as K loading is observed when loading V. That is, the kernel issues data loading, and then waits to consume the data immediately causing the global load latency to be exposed. However, when using the unrolling technique mentioned above, the compiler allocates the temporary buffer on local memory instead of registers causing a large slow down.

Solution:

We adopt the data prefetching technique for V loading. We load the next iteration V values immediately after the current iteration values are consumed. This allows the data loading to be overlapped with the PK computation resulting in better kernel performance.

Results:

Table 7 Performance of Optimization 5 for INT4 GQA (row-wise quantization)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CU FD CU vs FD vs CU baseline
Baseline Opt 5 Baseline Opt 5
32 137 143 109 262 250 327 1.25 1.31
64 234 257 194 305 278 370 1.21 1.33
128 432 455 345 331 314 414 1.25 1.32
256 815 866 649 351 331 441 1.26 1.33
512 1581 1659 1244 362 345 460 1.27 1.33

Optimization 6: Add Group-Wise INT4 (Groups = 4) with Vector Load

Problem Analysis:

Prior to this optimization, CU only supported row-wise INT4 quantization. That is, every column in each row shares the same scales. The scales of each row are stored in the first 4 bytes of each row as shown in Figure 10. In the kernel, each thread loads only one row at a time. Since each row contains 68 bytes (4 bytes for scales and 64 bytes for data), it cannot guarantee that every row aligns with a size of any vector type. Thus, vector loads cannot be used for loading the KV cache.

Figure 10: The layout of each row of INT4 KV cache with row-wise quantization

Figure 10 The layout of each row of INT4 KV cache with row-wise quantization

Solution:

We have implemented support for group-wise INT4 quantization with num groups = 4. In this case, columns in each row in the KV cache tensor are divided into 4 equal groups. Columns within the same group share the same scales for quantization/dequantization. The data layout for INT4 KV cache is shown in Figure 11. The scales for all groups are serialized and stored at the beginning of each row. The INT4 data is also serialized and laid out next to the scales.

Because the number of bytes in each row now becomes 80 bytes, we can use a vector type, i.e., uint2 in our case, to load data. (We do not use uint4 since each thread loads only 16 INT4s at a time due to the tensor core fragment size.) Vector load is generally better than scalar load since it does not cause extra byte loads.

Figure 11: The layout of each row of INT4 KV cache with row-wise quantization

Figure 11 The layout of each row of INT4 KV cache with row-wise quantization

Results:

Table 8 Performance of Optimization 6 for INT4 GQA (row-wise quantization)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CU FD CU vs FD vs CU baseline
Baseline Opt 6 Baseline Opt 6
32 137 143 111 262 250 322 1.23 1.29
64 234 257 192 305 278 372 1.22 1.34
128 432 455 346 331 314 414 1.25 1.32
256 815 866 642 351 331 446 1.27 1.35
512 1581 1659 1244 362 345 460 1.27 1.33

Table 9 Performance of Optimization 6 for INT4 GQA (group-wise quantization with num groups = 4)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CUDA_WMMA FD CUDA_WMMA vs FD
Opt 6 Opt 6
32 129 116 325 364 1.31
64 219 195 385 431 1.36
128 392 347 429 484 1.39
256 719 638 468 527 1.41
512 1375 1225 489 550 1.43

Optimization 7: Compute max QKT From WMMA Fragment Directly (A100/H100 specific)

Problem Analysis:

We observe large stalls due to shared memory accessing during the max QKT computation (showing as large short scoreboard stalls) as shown in Figure 12.

Figure 12: Stalls due to shared memory access during max QKT computation

Figure 12 Stalls due to shared memory access during max QKT computation (the number that the arrow points to is stall cycles caused by shared memory wait)

Solution:

We bypass shared memory when computing max QKT by computing it from the WMMA fragment (i.e., the tensor core fragment) directly. The layout of the WMMA fragment is specific to the GPU architecture. In this optimization, we only enabled this optimization for the NVIDIA A100/H100 GPUs. Other GPUs will still use shared memory for the max QKT computation. By bypassing shared memory, we effectively eliminate the stalls caused by shared memory access. The tensor core layout of the C fragment which is used for storing the QKT results is shown in Figure 13.

Figure 13: C fragment (QKT storage) tensor core layout on A100/H100

Figure 13 C fragment (QKT storage) tensor core layout on A100/H100

Table 10 Performance of Optimization 7 for INT4 GQA (row-wise quantization)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CU FD CU vs FD vs CU baseline
Baseline Opt 7 Baseline Opt 7
32 137 143 107 262 250 333 1.27 1.33
64 234 257 183 305 278 391 1.28 1.40
128 432 455 333 331 314 430 1.30 1.37
256 815 866 620 351 331 461 1.31 1.40
512 1581 1659 1206 362 345 475 1.31 1.38

Table 11 Performance of Optimization 7 for INT4 GQA (group-wise quantization with num groups = 4)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CUDA_WMMA FD CUDA_WMMA vs FD vs CUDA_WMMA Opt 6
Opt 6 Opt 7 Opt 6 Opt 7
32 129 116 111 325 364 380 1.17 1.04
64 219 195 187 385 431 449 1.17 1.04
128 392 347 333 429 484 506 1.18 1.04
256 719 638 615 468 527 547 1.17 1.04
512 1375 1225 1184 489 550 569 1.16 1.03

Optimization 8: Write FP32->BF16 Results to P Fragment Directly (A100/H100 specific)

Problem Analysis:

During the FP32-BF16 conversion for the P fragment, the kernel loads the FP32 data from shared memory, does the conversion and then stores the BF16 data back to shared memory. Moreover, the conversion requires many thread block synchronizations (__syncthreads()).

Solution:

Due to the data partitioning design of the kernel, each warp performs only one pass through the P fragment. Thus, we do not have to write the conversion results back to the shared memory for future usage. To avoid writing the BF16 data to the shared memory and thread block synchronizations, we have each warp load the FP32 data of the P WMMA fragment from the shared memory, do the conversion and then write the BF16 data directly to the P fragment.

Note that this optimization is applied to only the NVIDIA A100 and H100 GPUs because the WMMA fragment layout is architecture dependent. For non-A100/H100 GPUs, the kernel will fallback to the original path.

The P fragment tensor core layout is shown in Figure 14. Note that this layout is specific to the NVIDIA A100/H100 GPU.

Figure 14: P fragment tensor core layout on A100/H100

Figure 14 P fragment tensor core layout on A100/H100

Table 12 Performance of Optimization 8 for INT4 GQA (row-wise quantization)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CU FD CU vs FD vs CU baseline
Baseline Opt 8 Baseline Opt 8
32 137 143 101 262 250 353 1.35 1.41
64 234 257 174 305 278 410 1.34 1.47
128 432 455 317 331 314 451 1.36 1.43
256 815 866 590 351 331 485 1.38 1.47
512 1581 1659 1143 362 345 501 1.38 1.45

Table 13 Performance of Optimization 8 for INT4 GQA (group-wise quantization with num groups = 4)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CUDA_WMMA FD CUDA_WMMA vs FD vs CUDA_WMMA Opt 6
Opt 6 Opt 8 Opt 6 Opt 8
32 129 116 106 325 364 396 1.22 1.09
64 219 195 180 385 431 467 1.21 1.08
128 392 347 319 429 484 528 1.23 1.09
256 719 638 596 468 527 565 1.21 1.07
512 1375 1225 1138 489 550 591 1.21 1.08

Optimization 9: Swizzle P Shared Memory Layouts (A100/H100 specific)

Problem Analysis:

We observe large shared memory bank conflicts during P loading. The amount of bank conflict depends on the memory access stride. For instance, for split-Ks = 32 and max seq length = 8192, we observed that only 4 out of 32 banks are being accessed in parallel (memory access stride = 256). From Figure 14, when all threads access element 0, threads that have the same threadIdx.x % 4 access the same bank.

Figure 15: P fragment in shared memory before swizzling

Figure 15 P fragment in shared memory before swizzling

Solution:

We shuffle the layout of P load/store in the shared memory in such a way that avoids bank conflicts. In other words, we store the QKT results (C fragment) and load them (P fragment) using the swizzled layout. Moreover, instead of using the original memory access stride which is dependent on the number of tokens per thread block, we use the fragment’s column size as the stride which is constant. Thus, the load and store of the P fragment is always contiguous.

The new layouts for the C and P fragments are shown in Figure 16. With the new layout, it is guaranteed that 16 banks are being accessed in parallel as shown in Figure 17.

Figure 16: The swizzled layouts of C and P fragments

Figure 16 The swizzled layouts of C and P fragments

Figure 17: P fragment in shared memory after swizzling

Figure 17 P fragment in shared memory after swizzling

Table 14 Performance of Optimization 9 for INT4 GQA (row-wise quantization)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CU FD CU vs FD vs CU baseline
Baseline Opt 9 Baseline Opt 9
32 137 143 98 262 250 365 1.39 1.46
64 234 257 167 305 278 429 1.41 1.54
128 432 455 299 331 314 479 1.45 1.52
256 815 866 549 351 331 521 1.48 1.58
512 1581 1659 1060 362 345 540 1.49 1.56

Table 15 Performance of Optimization 9 for INT4 GQA (group-wise quantization with num groups = 4)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CUDA_WMMA FD CUDA_WMMA vs FD vs CUDA_WMMA Opt 6
Opt 6 Opt 9 Opt 6 Opt 9
32 129 116 105 325 364 400 1.23 1.10
64 219 195 174 385 431 484 1.26 1.12
128 392 347 302 429 484 558 1.30 1.15
256 719 638 560 468 527 601 1.28 1.14
512 1375 1225 1065 489 550 632 1.29 1.15

Optimization 10: Pad Shared Memory for INT4 Dequantization

Problem Analysis:

Once the kernel reads the INT4 K or V cache from global memory, it performs dequantization and stores the results (BF16) in the shared memory. Then, the BF16 data is loaded to the WMMA fragment from shared memory (via the WMMA interface). We observed a large number of bank conflicts for both K and V accesses. For instance, for K stores, only 4 out of 32 banks are being accessed in parallel. For K loads, 16 banks are being accessed in parallel. The same also occurs for V stores and loads. See the figures in the solution section.

Solution:

We pad the shared memory to reduce the bank conflict. Specifically, we pad each row by 2. That is, the row stride of K becomes F_K + 2 and the row stride of V becomes F_N + 2 (F_K and F_N are the fixed widths of the K and V WMMA fragments, respectively). With this optimization, we are able to reduce the bank conflict by 1.8x as shown in Figure 18.

Figure 18: Bank conflicts before and after Optimization 10

Figure 18 Bank conflicts before and after Optimization 10

After Optimization 10, for K stores, 32 banks are being accessed in parallel (shown in Figure 19), while for K loads, 29 banks are accessed in parallel (shown in Figure 20).

Figure 19: K fragment store shared memory layout without and with padding

Figure 19 K fragment store shared memory layout without and with padding

Figure 20: K fragment load shared memory layout without and with padding

Figure 20 K fragment load shared memory layout without and with padding

Table 16 Performance of Optimization 10 for INT4 GQA (row-wise quantization)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CU FD CU vs FD vs CU baseline
Baseline Opt 10 Baseline Opt 10
32 137 143 94 262 250 380 1.45 1.52
64 234 257 151 305 278 475 1.55 1.71
128 432 455 266 331 314 538 1.63 1.71
256 815 866 489 351 331 586 1.67 1.77
512 1581 1659 930 362 345 616 1.70 1.79

Table 17 Performance of Optimization 10 for INT4 GQA (group-wise quantization with num groups = 4)

Batch size Time (us) Bandwidth (GB/s) Speed up
FD CUDA_WMMA FD CUDA_WMMA vs FD vs CUDA_WMMA Opt 6
Opt 6 Opt 10 Opt 6 Opt 10
32 129 116 99 325 364 425 1.31 1.17
64 219 195 161 385 431 523 1.36 1.21
128 392 347 282 429 484 598 1.39 1.23
256 719 638 509 468 527 662 1.41 1.25
512 1375 1225 965 489 550 698 1.43 1.27

Performance Evaluation

Microbenchmark results

We also evaluated BF16 GQA performance using our optimized kernel (as shown in Table 19). CU still performs generally worse than FD and FA for BF16. This is expected since our optimizations are INT4 focused.

While INT4 GQA is still not as efficient as BF16 GQA (see the achieved bandwidths), it is important to note that when comparing FD BF16 GQA performance against CU INT4 GQA performance, we can see that the latency of INT4 is smaller than that of BF16.

Table 19 Performance of BF16 GQA and INT GQA after CU optimizations

On A100

Time (us) BF16 GQA INT4 GQA
Batch size FD FA CU before CU after FD FA CU before CU after
32 139 133 183 163 137 143 94
64 245 229 335 276 234 257 151
128 433 555 596 517 432 455 266
256 826 977 1127 999 815 866 489
512 1607 1670 2194 1879 1581 1659 930
Effective Bandwidth (GB/s) BF16 GQA INT4 GQA
Batch size FD FA CU before CU after FD FA CU before CU after
32 965 1012 736 824 262 250 380
64 1097 1175 802 972 305 278 475
128 1240 968 901 1039 331 314 538
256 1301 1100 954 1075 351 331 586
512 1338 1287 980 1144 362 345 616

On H100

Time (us) BF16 GQA INT4 GQA
Batch size FD FA CU before CU after FD FA CU before CU after
32 91 90 114 100 70 96 64
64 148 146 200 183 113 162 101
128 271 298 361 308 205 294 170
256 515 499 658 556 389 558 306
512 1000 1011 1260 1066 756 1066 575
Effective Bandwidth (GB/s) BF16 GQA INT4 GQA
Batch size FD FA CU before CU after FD FA CU before CU after
32 1481 1496 1178 1341 511 371 560
64 1815 1840 1345 1470 631 443 710
128 1982 1802 1487 1743 699 487 844
256 2087 2156 1634 1934 736 513 935
512 2150 2127 1706 2015 757 537 996

E2E results

We evaluated our optimized INT4 GQA kernel in Llama 2 70B on 8 H100 GPUs. We ran the model end-to-end, but only reported the decode latency. We use FP8 FFN (feed forward network) to emphasize the attention performance in the decoding phase. We vary the batch size from 1 to 256 and the context length from 2,048 (2K) to 16,384 (16K). The E2E performance results are shown in the figure below.

Figure 21: Meta Llama 2 decode latency (ms) comparison

Figure 21 Meta Llama 2 decode latency (ms) comparison (BF16 GQA runs out of memory in large batch size configurations)

Code

If you are interested, please checkout our code here. If you have any questions, please feel free to open an issue on GitHub, and we will be happy to help. Your contributions are welcome!

Read More

Detect email phishing attempts using Amazon Comprehend

Detect email phishing attempts using Amazon Comprehend

Phishing is the process of attempting to acquire sensitive information such as usernames, passwords and credit card details by masquerading as a trustworthy entity using email, telephone or text messages. There are many types of phishing based on the mode of communication and targeted victims. In an Email phishing attempt, an email is sent as a mode of communication to group of people. There are traditional rule-based approaches to detect email phishing. However, new trends are emerging that are hard to handle with a rule-based approach. There is need to use machine learning (ML) techniques to augment rule-based approaches for email phishing detection.

In this post, we show how to use Amazon Comprehend Custom to train and host an ML model to classify if the input email is an phishing attempt or not. Amazon Comprehend is a natural-language processing (NLP) service that uses ML to uncover valuable insights and connections in text. You can use Amazon Comprehend to identify the language of the text; extract key phrases, places, people, brands, or events; understand sentiment about products or services; and identify the main topics from a library of documents. You can customize Amazon Comprehend for your specific requirements without the skillset required to build ML-based NLP solutions. Comprehend Custom builds customized NLP models on your behalf, using training data that you provide. Comprehend Custom supports custom classification and custom entity recognition.

Solution overview

This post explains how you can use Amazon Comprehend to easily train and host an ML based model to detect phishing attempt. The following diagram shows how the phishing detection works.

Solution Overview

You can use this solution with your email servers in which emails are passed through this phishing detector. When an email is flagged as a phishing attempt, the email recipient still gets the email in their mailbox, but they can be shown an additional banner highlighting a warning to the user.

You can use this solution for experimentation with the use case, but AWS recommends building a training pipeline for your environments. For details on how to build a classification pipeline with Amazon Comprehend, see Build a classification pipeline with Amazon Comprehend custom classification.

We walk through the following steps to build the phishing detection model:

  1. Collect and prepare the dataset.
  2. Load the data in an Amazon Simple Storage Service (Amazon S3) bucket.
  3. Create the Amazon Comprehend custom classification model.
  4. Create the Amazon Comprehend custom classification model endpoint.
  5. Test the model.

Prerequisites

Before diving into this use case, complete the following prerequisites:

  1. Set up an AWS account.
  2. Create an S3 bucket. For instructions, see Create your first S3 bucket.
  3. Download the email-trainingdata.csv and upload the file to the S3 bucket.

Collect and prepare the dataset

Your training data should have both phishing and non-phishing emails. Email users with in the organization are asked to report phishing through their email clients. Gather all these phishing reports and examples of non-phishing emails to prepare the training data. You should have a minimum 10 examples per class. Label phishing emails as phishing and non-phishing emails as nonphishing. For minimum training requirements, see General quotas for document classification. Although minimum labels per class is a starting point, it’s recommended to provide hundreds of labels per class for performance on classification tasks across new inputs.

For custom classification, you train the model in either single-label mode or multi-label mode. Single-label mode associates a single class with each document. Multi-label mode associates one or more classes with each document. For this case, we will use single-label mode – phishing or nonphishing. The individual classes are mutually exclusive. For example, you can classify an email as phishing or not-phishing, but not both.

Custom classification supports models that you train with plain-text documents and models that you train with native documents (such as PDF, Word, or images). For more information about classifier models and their supported document types, see Training classification models. For a plain-text model, you can provide classifier training data as a CSV file or as an augmented manifest file that you create using Amazon SageMaker Ground Truth. The CSV file or augmented manifest file includes the text for each training document, and its associated labels.For a native document model, you provide classifier training data as a CSV file. The CSV file includes the file name for each training document and its associated labels. You include the training documents in the S3 input folder for the training job.

For this case, we will train a plain-text model using CSV file format. For each row, the first column contains the class label value. The second column contains an example text document for that class. Each row must end with n or rn characters.

The following example shows a CSV file containing two documents.

CLASS,Text of document 1

CLASS,Text of document 2

The following example shows two rows of a CSV file that trains a custom classifier to detect whether an email message is phishing:

phishing, “Hi, we need account details and SSN information to complete the payment. Please furnish your credit card details in the attached form.”

nonphishing,” Dear Sir / Madam, your latest statement was mailed to your communication address. After your payment is received, you will receive a confirmation text message at your mobile number. Thanks, customer support”

For information about preparing your training documents, see Preparing classifier training data.

Load the data in the S3 bucket

Load the training data in CSV format to the S3 bucket you created in the prerequisite steps. For instructions, refer to Uploading objects.

Load Data to S3

Create the Amazon Comprehend custom classification model

Custom classification supports two types of classifier models: plain-text models and native document models. A plain-text model classifies documents based on their text content. You can train the plain-text model using documents in one of following languages: English, Spanish, German, Italian, French, or Portuguese. The training documents for a given classifier must all use the same language. A native document model has the ability to process both scanned or digital semi-structured documents like PDFs, Microsoft Word documents, and images in their native format. A native document model also classifies documents based on text content. A native document model can also use additional signals, such as from the layout of the document. You train a native document model with native documents for the model to learn the layout information. You train the model using semi-structured documents, which includes the following document types such as digital and scanned PDF documents and Word documents; Images sunch as JPG files, PNG files, and single-page TIFF files and Amazon Textract API output JSON files. AWS recommends using a plain-text model to classify plain-text documents and a native document model to classify semi-structured documents.

Data specification for the custom classification model can be represented as follows.

Data Specification

You can train a custom classifier using either the Amazon Comprehend console or API. Allow several minutes to a few hours for the classification model creation to complete. The length of time varies based on the size of your input documents.

For training a customer classifier on the Amazon Comprehend console, set the following data specification options.

Train Model Data Input

Training Data Output

On the Classifiers page of the Amazon Comprehend console, the new classifier appears in the table, showing Submitted as its status. When the classifier starts processing the training documents, the status changes to Training. When a classifier is ready to use, the status changes to Trained or Trained with warnings. If the status is Trained with Warnings, review the skipped files folder in the classifier training output.

Model Version

If Amazon Comprehend encountered errors during creation or training, the status changes to In error. You can choose a classifier job in the table to get more information about the classifier, including any error messages.

After training the model, Amazon Comprehend tests the custom classifier model. If you don’t provide a test dataset, Amazon Comprehend trains the model with 90% of the training data. It reserves 10% of the training data to use for testing. If you do provide a test dataset, the test data must include at least one example for each unique label in the training dataset.

After Amazon Comprehend completes the custom classifier model training, it creates output files in the Amazon S3 output location that you specified in the CreateDocumentClassifier API request or the equivalent Amazon Comprehend console request. These output files are a confusion matrix and additional outputs for native document models. The format of the confusion matrix varies, depending on whether you trained your classifier using multi-class mode or multi-label mode.

After Amazon Comprehend creates the classifier model, the confusion matrix is available in the confusion_matrix.json file in the Amazon S3 output location. This confusion matrix provides metrics on how well the model performed in training. This matrix shows a matrix of labels that the model predicted, compared to the actual document labels. Amazon Comprehend uses a portion of the training data to create the confusion matrix. The following JSON file represents the matrix in confusion_matrix.json as an example.

Confusion Matrix

Amazon Comprehend provides metrics to help you estimate how well a custom classifier performs. Amazon Comprehend calculates the metrics using the test data from the classifier training job. The metrics accurately represent the performance of the model during training, so they approximate the model performance for classification of similar data.

Use the Amazon Comprehend console or API operations such as DescribeDocumentClassifier to retrieve the metrics for a custom classifier.

Model Version Performance

The actual output of many binary classification algorithms is a prediction score. The score indicates the system’s certainty that the given observation belongs to the positive class. To make the decision about whether the observation should be classified as positive or negative, as a consumer of this score, you interpret the score by picking a classification threshold and comparing the score against it. Any observations with scores higher than the threshold are predicted as the positive class, and scores lower than the threshold are predicted as the negative class.

Prediction Score

Create the Amazon Comprehend custom classification model endpoint

After you train a custom classifier, you can classify documents using Real-time analysis or an analysis job. Real-time analysis takes a single document as input and returns the results synchronously. An analysis job is an asynchronous job to analyze large documents or multiple documents in one batch. The following are the different options for using the custom classifier model.

Custom Classification Inference Types

Create an endpoint for the trained model. For instructions, refer to Real-tome analysis for customer classification (console). Amazon Comprehend assigns throughput to an endpoint using Inference Units (IU). An IU represents data throughput of 100 characters per second. You can provision the endpoint with up to 10 IU. You can scale the endpoint throughput either up or down by updating the endpoint. Endpoints are billed on 1-second increments, with a minimum of 60 seconds. Charges will continue to incur from the time you start the endpoint until it is deleted even if no documents are analyzed.

Create Model Endpoint

Test the Model

After the endpoint is ready, you can run the real-time analysis from the Amazon Comprehend console.

Real Time Endpoint

The sample input represents the email text, which is used for real-time analysis to detect if the email text is a phishing attempt or not.

Model Inference Input

Amazon Comprehend analyzes the input data using the custom model. Amazon Comprehend displays the discovered classes, along with a confidence assessment for each class. The insights section shows the inference results with confidence levels of the nonphishing and phishing classes. You can decide the threshold to decide the class of the inference. In this case, nonphishing is the inference results because this has more confidence than the phishing class. The model detects the input email text is a non-phishing email.

Model Inference Output

To integrate this capability of phishing detection in your real-world applications, you can use the Amazon API Gateway REST API with an AWS Lambda integration. Refer to the serverless pattern in Amazon API Gateway to AWS Lambda to Amazon Comprehend to know more.

Clean up

When you no longer need your endpoint, you should delete it so that you stop incurring costs from it. Also, delete the data file from S3 bucket. For more information on costs, see Amazon Comprehend Pricing.

Model endpoint cleanup

Conclusion

In this post, we walked you through the steps to create a phishing attempt detector using Amazon Comprehend custom classification. You can customize Amazon Comprehend for your specific requirements without the skillset required to build ML-based NLP solutions.

You can also visit the Amazon Comprehend Developer Guide, GitHub repository and Amazon Comprehend developer resources for videos, tutorials, blogs, and more.


About the author

Ajeet Tewari is a Solutions Architect for Amazon Web Services. He works with enterprise customers to help them navigate their journey to AWS. His specialties include architecting and implementing highly scalable OLTP systems and leading strategic AWS initiatives.

Read More

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies. 

Below is a brief recap of the event, including select quotes from the presentations. Full replays of each session and presentation will be available soon. 

Keynote: Building Globally Equitable AI

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi at Microsoft Research Forum Episode 3

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi 

Jacki O’Neill discussed the importance of creating globally equitable generative AI. She addressed the technical and sociotechnical challenges that must be tackled to positively transform the future of work worldwide.

“We’re at the very early stage of generative AI and the impacts it will have on work. This is a fast-moving field, and there’s an immense opportunity to take control of the agenda and build truly globally equitable AI systems. This requires ensuring that diverse contexts and applications, with their diverse datasets, drive the development of generative AI.”

Panel discussion: Generative AI for Global Impact: Challenges and Opportunities

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi (host)

Sunayana Sitaram, Principal Researcher, Microsoft Research India

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi (host)
Sunayana Sitaram, Principal Researcher, Microsoft Research India
Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge
Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

Microsoft researchers discussed the challenges and opportunities of making AI more inclusive and impactful for everyone—from data that represents a broader range of communities and cultures to novel use cases for AI that are globally relevant.

“How can we take this power of generative AI and empower every individual, every individual across the globe—the people who are coming from different nationalities, different ethnicities, cultures, as well as with varied technology access and financial affordability?”

—Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

“One of the solutions that we’ve been using is to actually design with ‘human in the loop’ in mind because we know that these technologies are not perfect. And so, we really want to figure out ways in which humans and AI systems can work together in order to create the most effective outcome.”

—Sunayana Sitaram, Principal Researcher, Microsoft Research India

“We really need multidisciplinary research that goes beyond anything that we’ve done before, involving researchers and practitioners and community members. And it’s important to remember that machine learning engineers and researchers on their own can’t solve the problem of building globally equitable generative AI. This is something that we really need to do in a large scale.”

—Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi 

“An estimated 1.3 billion people—around 16 percent of the global population—live with some level of disability today. So, I think it’s really exciting to see these generative AI applications coming online for these communities.” 

“As we look to this next decade of generative AI solutions, I really hope to see that we’re going to see more personalized AI models and solutions come through much more strongly, solutions where you as the user have much more control, much more agency, around how your model works for you.” 

—Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Lightning talk: Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: A Case Study on CLIP

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge at Research Forum Episode 3

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Daniela Massiceti explored the transformative potential of multimodal models such as CLIP for assistive technologies. Specifically focusing on the blind/low-vision community, the talk explored the current distance from realizing this potential and the advancements needed to bridge this gap.

“Today’s AI models hold incredible potential for assisting the Blind community—from text recognition to object identification to question answering. Apps like Seeing AI are already deploying some of these AI features. But there is potential for much more.”

Lightning talk: Driving Industry Evolution: Exploring the Impact of Generative AI on Sector Transformation

Jiang Bian, Senior Principal Research Manager, Microsoft Research Asia, at Research Forum Episode 3

Jiang Bian, Senior Principal Research Manager, Microsoft Research Asia

Jiang Bian discussed how generative AI transforms industries by bridging gaps between AI capabilities and industrial needs.

“In our dialogues with strategic partners, we have identified crucial gaps in current generative AI capabilities versus the specific needs of industry applications. These include a too-narrow focus on human-like AI but not critical industry applications, limitations in processing complex and noisy data, and concerns about reliability in complex decision-making scenarios. Our research is crucial in addressing these limitations and amplifying the underappreciated potential of generative AI in high-value sectors.” 

Lightning talk: MatterGen: A Generative Model for Materials Design

Tian Xie, Principal Research Manager, Microsoft Research, at Research Forum Episode 3

Tian Xie, Principal Research Manager, Microsoft Research

Tian Xie described MatterGen, a generative model that enables the design of new inorganic materials based on a broad range of property conditions required by the application, aiming to shift the traditional paradigm of materials design with generative AI.

“Traditionally, materials design is conducted by search-based methods. We search through a list of candidates and gradually filter them using a list of design criteria for the application. Like for batteries, we need the materials to contain lithium, to be stable, to have a high lithium-ion conductivity, and each filtering step can be conducted using simulation-based methods or AI emulators. At the end, we get five to 10 candidates that we’re sending to the lab for experimental synthesis.” 

“In MatterGen, we hope to rethink this process with generative AI. We’re aiming to directly generate materials given the design requirements for the target application, bypassing the process of searching through candidates. You can think of it as using text-to-image generative models like DALL-E to generate the images given a prompt rather than needing to search through the entire internet for images via a search engine.” 

Lightning talk: AutoGen Update: Complex Tasks and Agents

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers, at Research Forum Episode 3

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers 

Adam Fourney discussed the effectiveness of using multiple agents, working together, to complete complex multi-step tasks. He showcased their capability to outperform previous single-agent solutions on benchmarks like GAIA, utilizing customizable arrangements of agents that collaborate, reason, and utilize tools to achieve complex outcomes.

“We’re starting to tackle increasingly more complex benchmarks and real-world scenarios with this configuration. And we’re really excited about opportunities to introduce new agents that, for example, learn and self-improve with experience; that understand images and screenshots a little better for maybe more effective web surfing or use of interfaces; and that are maybe a bit more systematic about exploring that solution space. So rather than just updating that ledger and then restarting when they get stuck, they can be a bit more pragmatic about the strategies that they’re employing.”

The post Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more appeared first on Microsoft Research.

Read More

Microsoft at FAccT 2024: Advancing responsible AI research and practice

Microsoft at FAccT 2024: Advancing responsible AI research and practice

Microsoft at ACM FAccT 2024

The integration of AI and other computational technologies is becoming increasingly common in high-stakes sectors such as finance, healthcare, and government, where their capacity to influence critical decisions is growing. While these systems offer numerous benefits, they also introduce risks, such as entrenching systemic biases and reducing accountability. The ACM Conference on Fairness, Accountability, and Transparency (ACM FaccT 2024) tackles these issues, bringing together experts from a wide range of disciplines who are committed to the responsible development of computational systems.

Microsoft is proud to return as a sponsor of ACM FAccT 2024, underscoring our commitment to supporting research on responsible AI. We’re pleased to share that members of our team have taken on key roles in organizing the event, contributing to the program committee and serving as a program co-chair. Additionally, seven papers by Microsoft researchers and their collaborators have been accepted to the program, with “Akal badi ya bias: An exploratory study of gender bias in Hindi language technology,” receiving an award for Best Paper. 

Collectively, these research projects emphasize the need for AI technologies that reflect the Microsoft Responsible AI principles of accountability, inclusiveness, reliability and safety, fairness, transparency, and privacy and security. They underscore the importance of addressing potential risks and harms associated with deployment and usage. This post highlights these advances.

Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.


Paper highlights

A framework for exploring the consequences of AI-mediated enterprise knowledge access and identifying risks to workers

Anna Gausen, Bhaskar Mitra, Siân Lindley

Recent AI developments, especially LLMs, are significantly impacting organizational knowledge access and reshaping workplaces. These AI systems pose risks due to their interaction with organizational power dynamics. This paper introduces the Consequence-Mechanism-Risk framework to help identify worker risks, categorizing them into issues related to value, power, and wellbeing. The framework aims to help practitioners mitigate these risks and apply it to other technologies, enabling better protection for workers.

A structured regression approach for evaluating model performance across intersectional subgroups

Christine Herlihy, Kimberly Truong, Alex Chouldechova, Miro Dudík

Disaggregated evaluation is a process used in AI fairness assessment that measures AI system performance across different subgroups. These subgroups are defined by a mix of demographic or other sensitive attributes. However, the sample size for intersectional subgroups is often very small, leading to their exclusion from analysis. This work introduces a structured regression approach for more reliable system performance estimates in these subgroups. Tested on two publicly available datasets and several variants of semi-synthetic data, this method not only yielded more accurate results but also helped to identify key factors driving performance differences. 

Akal badi ya bias: An exploratory study of gender bias in Hindi language technology

Best Paper Award

Rishav Hada, Safiya Husain, Varun Gumma, Harshita Diddee, Aditya Yadavalli, Agrima Seth, Nidhi Kulkarni, Ujwal Gadiraju, Aditya Vashistha, Vivek Seshadri, Kalika Bali

Existing research on gender bias in language technologies primarily focuses on English, often overlooking non-English languages. This paper introduces the first comprehensive study on gender bias in Hindi, the third most spoken language globally. Employing diverse techniques and field studies, the authors expose the limitations in current methodologies and emphasize the need for more context-specific and community-centered research. The findings deepen the understanding of gender bias in language technologies in Hindi and lay the groundwork for expanded research into other Indic languages.

“I’m not sure, but…”: Examining the impact of large language models’ uncertainty expression on user reliance and trust

Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, Jennifer Wortman Vaughan

LLMs can produce convincing yet incorrect responses, potentially misleading users who rely on them for accuracy. To mitigate this issue, there have been recommendations for LLMs to communicate uncertainty in their responses. In a large-scale study on how users perceive and act on LLMs’ expressions of uncertainty, participants were asked medical questions. The authors found that first-person uncertainty expressions (e.g., “I’m not sure, but…”) decreased participants’ confidence in the system and their tendency to agree with the system’s answers, while increasing the accuracy of their own answers. In contrast, more general uncertainty expressions (e.g., “It’s unclear, but…”) were less effective. The findings stress the importance of more thorough user testing before deploying LLMs.

Investigating and designing for trust in AI-powered code generation tools

Ruotong Wang, Ruijia Cheng, Denae Ford, Tom Zimmermann

As tools like GitHub Copilot gain popularity, understanding the trust software developers place in these applications becomes crucial for their adoption and responsible use. In a two-stage qualitative study, the authors interviewed 17 developers to understand the challenges they face in building trust in AI code-generation tools. Challenges identified include setting expectations, configuring tools, and validating suggestions. The authors also explore several design concepts to help developers establish appropriate trust and provide design recommendations for AI-powered code-generation tools.

Less discriminatory algorithms

Emily Black, Logan Koepke, Pauline Kim, Solon Barocas, Mingwei Hsu

In fields such as housing, employment, and credit, organizations using algorithmic systems should seek to use less discriminatory alternatives. Research in computer science has shown that for any prediction problem, multiple algorithms can deliver the same level of accuracy but differ in their impacts across demographic groups. This phenomenon, known as model multiplicity, suggests that developers might be able to find an equally performant yet potentially less discriminatory alternative.

Participation in the age of foundation models

Harini Suresh, Emily Tseng, Meg Young, Mary Gray, Emma Pierson, Karen Levy

The rise of foundation models in public services brings both potential benefits and risks, including reinforcing power imbalances and harming marginalized groups. This paper explores how participatory AI/ML methods, typically context-specific, can be adapted to these context-agnostic models to empower those most affected.

Conference organizers from Microsoft

Program Co-Chair

Alexandra Olteanu 

Program Committee

Steph Ballard 
Solon Barocas 
Su Lin Blodgett*
Kate Crawford 
Shipi Dhanorkar 
Amy Heger
Jake Hofman*
Emre Kiciman*
Vera Liao*
Daniela Massiceti 
Bhaskar Mitra 
Besmira Nushi*
Alexandra Olteanu 
Amifa Raj
Emily Sheng 
Jennifer Wortman Vaughan*
Mihaela Vorvoreanu*
Daricia Wilkinson

*Area Chairs

Career opportunities

Microsoft welcomes talented individuals across various roles at Microsoft Research, Azure Research, and other departments. We are always pushing the boundaries of computer systems to improve the scale, efficiency, and security of all our offerings. You can review our open research-related positions here.

The post Microsoft at FAccT 2024: Advancing responsible AI research and practice appeared first on Microsoft Research.

Read More

How Skyflow creates technical content in days using Amazon Bedrock

How Skyflow creates technical content in days using Amazon Bedrock

This guest post is co-written with Manny Silva, Head of Documentation at Skyflow, Inc.

Startups move quickly, and engineering is often prioritized over documentation. Unfortunately, this prioritization leads to release cycles that don’t match, where features release but documentation lags behind. This leads to increased support calls and unhappy customers.

Skyflow is a data privacy vault provider that makes it effortless to secure sensitive data and enforce privacy policies. Skyflow experienced this growth and documentation challenge in early 2023 as it expanded globally from 8 to 22 AWS Regions, including China and other areas of the world such as Saudi Arabia, Uzbekistan, and Kazakhstan. The documentation team, consisting of only two people, found itself overwhelmed as the engineering team, with over 60 people, updated the product to support the scale and rapid feature release cycles.

Given the critical nature of Skyflow’s role as a data privacy company, the stakes were particularly high. Customers entrust Skyflow with their data and expect Skyflow to manage it both securely and accurately. The accuracy of Skyflow’s technical content is paramount to earning and keeping customer trust. Although new features were released every other week, documentation for the features took an average of 3 weeks to complete, including drafting, review, and publication. The following diagram illustrates their content creation workflow.

Looking at our documentation workflows, we at Skyflow discovered areas where generative artificial intelligence (AI) could improve our efficiency. Specifically, creating the first draft—often referred to as overcoming the “blank page problem”—is typically the most time-consuming step. The review process could also be long depending on the number of inaccuracies found, leading to additional revisions, additional reviews, and additional delays. Both drafting and reviewing needed to be shorter to make doc target timelines match those of engineering.

To do this, Skyflow built VerbaGPT, a generative AI tool based on Amazon Bedrock. Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using the AWS tools without having to manage any infrastructure. With Amazon Bedrock, VerbaGPT is able to prompt large language models (LLMs), regardless of model provider, and uses Retrieval Augmented Generation (RAG) to provide accurate first drafts that make for quick reviews.

In this post, we share how Skyflow improved their workflow to create documentation in days instead of weeks using Amazon Bedrock.

Solution overview

VerbaGPT uses Contextual Composition (CC), a technique that incorporates a base instruction, a template, relevant context to inform the execution of the instruction, and a working draft, as shown in the following figure. For the instruction, VerbaGPT tells the LLM to create content based on the specified template, evaluate the context to see if it’s applicable, and revise the draft accordingly. The template includes the structure of the desired output, expectations for what sort of information should exist in a section, and one or more examples of content for each section to guide the LLM on how to process context and draft content appropriately. With the instruction and template in place, VerbaGPT includes as much available context from RAG results as it can, then sends that off for inference. The LLM returns the revised working draft, which VerbaGPT then passes back into a new prompt that includes the same instruction, the same template, and as much context as it can fit, starting from where the previous iteration left off. This repeats until all context is considered and the LLM outputs a draft matching the included template.

The following figure illustrates how Skyflow deployed VerbaGPT on AWS. The application is used by the documentation team and internal users. The solution involves deploying containers on Amazon Elastic Kubernetes Service (Amazon EKS) that host a Streamlit user interface and a backend LLM gateway that is able to invoke Amazon Bedrock or local LLMs, as needed. Users upload documents and prompt VerbaGPT to generate new content. In the LLM gateway, prompts are processed in Python using LangChain and Amazon Bedrock.

When building this solution on AWS, Skyflow followed these steps:

  1. Choose an inference toolkit and LLMs.
  2. Build the RAG pipeline.
  3. Create a reusable, extensible prompt template.
  4. Create content templates for each content type.
  5. Build an LLM gateway abstraction layer.
  6. Build a frontend.

Let’s dive into each step, including the goals and requirements and how they were addressed.

Choose an inference toolkit and LLMs

The inference toolkit you choose, if any, dictates your interface with your LLMs and what other tooling is available to you. VerbaGPT uses LangChain instead of directly invoking LLMs. LangChain has broad adoption in the LLM community, so there was a present and likely future ability to take advantage of the latest advancements and community support.

When building a generative AI application, there are many factors to consider. For instance, Skyflow wanted the flexibility to interact with different LLMs depending on the use case. We also needed to keep context and prompt inputs private and secure, which meant not using LLM providers who would log that information or fine-tune their models on our data. We needed to have a variety of models with unique strengths at our disposal (such as long context windows or text labeling) and to have inference redundancy and fallback options in case of outages.

Skyflow chose Amazon Bedrock for its robust support of multiple FMs and its focus on privacy and security. With Amazon Bedrock, all traffic remains inside AWS. VerbaGPT’s primary foundation model is Anthropic Claude 3 Sonnet on Amazon Bedrock, chosen for its substantial context length, though it also uses Anthropic Claude Instant on Amazon Bedrock for chat-based interactions.

Build the RAG pipeline

To deliver accurate and grounded responses from LLMs without the need for fine-tuning, VerbaGPT uses RAG to fetch data related to the user’s prompt. By using RAG, VerbaGPT became familiar with the nuances of Skyflow’s features and procedures, enabling it to generate informed and complimentary content.

To build your own content creation solution, you collect your corpus into a knowledge base, vectorize it, and store it in a vector database. VerbaGPT includes all of Skyflow’s documentation, blog posts, and whitepapers in a vector database that it can query during inference. Skyflow uses a pipeline to embed content and store the embedding in a vector database. This embedding pipeline is a multi-step process, and everyone’s pipeline is going to look a little different. Skyflow’s pipeline starts by moving artifacts to a common data store, where they are de-identified. If your documents have personally identifiable information (PII), payment card information (PCI), personal health information (PHI), or other sensitive data, you might use a solution like Skyflow LLM Privacy Vault to make de-identifying your documentation straightforward. Next, the pipeline chunks the documents into pieces, then finally calculates vectors for the text chunks and stores them in FAISS, an open source vector store. VerbaGPT uses FAISS because it is fast and straightforward to use from Python and LangChain. AWS also has numerous vector stores to choose from for a more enterprise-level content creation solution, including Amazon Neptune, Amazon Relational Database Service (Amazon RDS) for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon Kendra, Amazon OpenSearch Service, and Amazon DocumentDB (with MongoDB compatibility). The following diagram illustrates the embedding generation pipeline.

When chunking your documents, keep in mind that LangChain’s default splitting strategy can be aggressive. This can result in chunks of content that are so small that they lack meaningful context and result in worse output, because the LLM has to make (largely inaccurate) assumptions about the context, producing hallucinations. This issue is particularly noticeable in Markdown files, where procedures were fragmented, code blocks were divided, and chunks were often only single sentences. Skyflow created its own Markdown splitter to work more accurately with VerbaGPT’s RAG output content.

Create a reusable, extensible prompt template

After you deploy your embedding pipeline and vector database, you can start intelligently prompting your LLM with a prompt template. VerbaGPT uses a system prompt that instructs the LLM how to behave and includes a directive to use content in the Context section to inform the LLM’s response.

The inference process queries the vector database with the user’s prompt, fetches the results above a certain similarity threshold, and includes the results in the system prompt. The solution then sends the system prompt and the user’s prompt to the LLM for inference.

The following is a sample prompt for drafting with Contextual Composition that includes all the necessary components, system prompt, template, context, a working draft, and additional instructions:

System: """You're an expert writer tasked with creating content according to the user's request.
Use Template to structure your output and identify what kind of content should go in each section.
Use WorkingDraft as a base for your response.
Evaluate Context against Template to identify if there is any pertinent information.
If needed, update or refine WorkingDraft using the supplied Context.
Treat User input as additional instruction."""
---
Template: """Write a detailed how-to guide in Markdown using the following template:
# [Title]
This guide explains how to [insert a brief description of the task].
[Optional: Specify when and why your user might want to perform the task.]
...
"""
---
Context: [
  { "text": "To authenticate with Skyflow's APIs and SDKs, you need to create a service account. To create...", "metadata": { "source": "service-accounts.md" }},
  ...
]
---
WorkingDraft: ""
---
User: Create a how-to guide for creating a service account.

Create content templates

To round out the prompt template, you need to define content templates that match your desired output, such as a blog post, how-to guide, or press release. You can jumpstart this step by sourcing high-quality templates. Skyflow sourced documentation templates from The Good Docs Project. Then, we adapted the how-to and concept templates to align with internal styles and specific needs. We also adapted the templates for use in prompt templates by providing instructions and examples per section. By clearly and consistently defining the expected structure and intended content of each section, the LLM was able to output content in the formats needed, while being both informative and stylistically consistent with Skyflow’s brand.

LLM gateway abstraction layer

Amazon Bedrock provides a single API to invoke a variety of FMs. Skyflow also wanted to have inference redundancy and fallback options in case VerbaGPT experienced Amazon Bedrock service limit exceeded errors. To that end, VerbaGPT has an LLM gateway that acts as an abstraction layer that is invoked.

The main component of the gateway is the model catalog, which can return a LangChain llm model object for the specified model, updated to include any parameters. You can create this with a simple if/else statement like that shown in the following code:

from langchain.chains import LLMChain
from langchain_community.llms import Bedrock, CTransformers

prompt = ""   		# User input
prompt_template = ""   	# The LangChain-formatted prompt template object
rag_results = get_rag(prompt)   # Results from vector database

# Get chain-able model object and token limit.
def get_model(model=str,options=dict):
    if model == "claude-instant-v1":
        llm = Bedrock(
            model_id="anthropic.claude-instant-v1",
            model_kwargs={"max_tokens_to_sample": options["max_output_tokens"], "temperature": options["temperature"]}
        )
        token_limit = 100000

    elif model == "claude-v2.1":
        llm = Bedrock(
            model_id="anthropic.claude-v2.1",
            model_kwargs={"max_tokens_to_sample":  options["max_output_tokens"], "temperature": options["temperature"]}
        )
        token_limit = 200000

    elif model == "llama-2":
        config = {
            "context_length": 4096,
            "max_new_tokens": options["max_output_tokens"],
            "stop": [
                "Human:",
            ],
        }
        llm = CTransformers(
            model="TheBloke/Llama-2-7b-Chat-GGUF",
            model_file="llama-2-7b-chat.Q4_K_M.gguf",
            model_type="llama",
            config=config,
        )
        token_limit = 4096
  
    return llm, token_limit

llm, token_limit = get_model("claude-v2.1")

chain = LLMChain(
    llm=llm,
    prompt=prompt_template
)

response = chain.run({"input": prompt, "context":rag_results})

By mapping standard input formats into the function and handling all custom LLM object construction within the function, the rest of the code stays clean by using LangChain’s llm object.

Build a frontend

The final step was to add a UI on top of the application to hide the inner workings of LLM calls and context. A simple UI is key for generative AI applications, so users can efficiently prompt the LLMs without worrying about the details unnecessary to their workflow. As shown in the solution architecture, VerbaGPT uses Streamlit to quickly build useful, interactive UIs that allow users to upload documents for additional context and draft new documents rapidly using Contextual Composition. Streamlit is Python based, which makes it straightforward for data scientists to be efficient at building UIs.

Results

By using the power of Amazon Bedrock for inferencing and Skyflow for data privacy and sensitive data de-identification, your organization can significantly speed up the production of accurate, secure technical documents, just like the solution shown in this post. Skyflow was able to use existing technical content and best-in-class templates to reliably produce drafts of different content types in minutes instead of days. For example, given a product requirements document (PRD) and an engineering design document, VerbaGPT can produce drafts for a how-to guide, conceptual overview, summary, release notes line item, press release, and blog post within 10 minutes. Normally, this would take multiple individuals from different departments multiple days each to produce.

The new content flow shown in the following figure moves generative AI to the front of all technical content Skyflow creates. During the “Create AI draft” step, VerbaGPT generates content in the approved style and format in just 5 minutes. Not only does this solve the blank page problem, first drafts are created with less interviewing or asking engineers to draft content, freeing them to add value through feature development instead.

The security measures Amazon Bedrock provides around prompts and inference aligned with Skyflow’s commitment to data privacy, and allowed Skyflow to use additional kinds of context, such as system logs, without the concern of compromising sensitive information in third-party systems.

As more people at Skyflow used the tool, they wanted additional content types available: VerbaGPT now has templates for internal reports from system logs, email templates from common conversation types, and more. Additionally, although Skyflow’s RAG context is clean, VerbaGPT is integrated with Skyflow LLM Privacy Vault to de-identify sensitive data in user inference inputs, maintaining Skyflow’s stringent standards of data privacy and security even while using the power of AI for content creation.

Skyflow’s journey in building VerbaGPT has drastically shifted content creation, and the toolkit wouldn’t be as robust, accurate, or flexible without Amazon Bedrock. The significant reduction in content creation time—from an average of around 3 weeks to as little as 5 days, and sometimes even a remarkable 3.5 days—marks a substantial leap in efficiency and productivity, and highlights the power of AI in enhancing technical content creation.

Conclusion

Don’t let your documentation lag behind your product development. Start creating your technical content in days instead of weeks, while maintaining the highest standards of data privacy and security. Learn more about Amazon Bedrock and discover how Skyflow can transform your approach to data privacy.

If you’re scaling globally and have privacy or data residency needs for your PII, PCI, PHI, or other sensitive data, reach out to your AWS representative to see if Skyflow is available in your region.


About the authors

Manny Silva is Head of Documentation at Skyflow and the creator of Doc Detective. Technical writer by day and engineer by night, he’s passionate about intuitive and scalable developer experiences and likes diving into the deep end as the 0th developer.

Jason Westra is a Senior Solutions Architect for AWS AI/ML startups. He provides guidance and technical assistance that enables customers to build scalable, highly available, secure AI and ML workloads in AWS Cloud.

Read More

Creativity Accelerated: New RTX-Powered AI Hardware and Software Announced at COMPUTEX

Creativity Accelerated: New RTX-Powered AI Hardware and Software Announced at COMPUTEX

NVIDIA launched NVIDIA Studio at COMPUTEX in 2019. Five years and more than 500 NVIDIA RTX-accelerated apps and games later, it’s bringing AI to even more creators with an array of new RTX technology integrations announced this week at COMPUTEX 2024.

Newly announced NVIDIA GeForce RTX AI laptops — including the ASUS ProArt PX13 and P16 and MSI Stealth 16 AI+ laptops — will feature dedicated RTX Tensor Cores to accelerate AI performance and power-efficient systems-on-a-chip with Windows 11 AI PC features. They join over 200 laptops already accelerated with RTX AI technology.

NVIDIA RTX Video, a collection of technologies including RTX Video Super Resolution and RTX Video HDR that enhance video content streamed in browsers like Google Chrome, Microsoft Edge and Mozilla Firefox, is coming to the free VLC Media Player. And for the first time in June, creators can enjoy these AI-enhanced video effects in popular creative apps like DaVinci Resolve and Wondershare Filmora.

DaVinci Resolve and Cyberlink PowerDirector are adding NVIDIA’s new H.265 Ultra-High-Quality (UHQ) mode, which uses the NVIDIA NVENC to increase high-efficiency video coding (HEVC) and encoding efficiency by 10%.

NVIDIA RTX Remix, a modding platform for remastering classic games with RTX, will soon be made open source, allowing more modders to streamline how assets are replaced and scenes are relit. RTX Remix will also be made accessible via a new REST application programming interface (API) to connect the platform to other modding tools like Blender and Hammer.

Creative apps are continuing to adopt AI-powered NVIDIA DLSS for higher-quality ray-traced visuals in the viewport, with 3D modeling platform Womp being the latest to integrate DLSS 3.5 with Ray Reconstruction.

NVIDIA unveiled Project G-Assist, an RTX-powered AI-assistant technology demo that provides context-aware help for PC games and apps.

The new NVIDIA app beta update adds 120 frames per second AV1 video capture and one-click performance-tuning.

And the latest Game Ready Driver and NVIDIA Studio Driver are available for installation today.

Video Gets the AI Treatment

RTX Video is a collection of real-time, AI-based video enhancements — powered by RTX GPUs equipped with AI Tensor Cores — to dramatically improve video quality.

It includes RTX Video Super Resolution — an upscaling technology that removes compression artifacts and generates additional pixels to improve video sharpness and clarity up to 4K — and RTX Video HDR, which transforms standard dynamic range videos into stunning high-dynamic range on HDR10 displays.

NVIDIA has released the RTX Video software development kit, which allows app developers to add RTX Video effects to creator workflows.

Blackmagic Design’s DaVinci Resolve, a powerful video editing app with color correction, visual effects, graphics and audio post-production capabilities, will be one of the first to integrate RTX Video. The integration is being demoed on the COMPUTEX show floor.

Wondershare Filmora, a video editing app with AI tools and pro-level social media video editing features, will support RTX Video HDR, coming soon.

Wondershare Filmora will soon support RTX Video HDR.

VLC Media Player — an open-source, cross-platform media player, has added RTX Video HDR in its latest beta release, following its recently added support for Mozilla Firefox.

NVIDIA hardware encoders deliver a generational boost in encoding efficiency to HEVC. Performance tested on dual Xeon Gold-6140@2.3GHz running NVIDIA L4 Tensor Core GPUs with driver 520.65.

NVIDIA also released a new UHQ mode in NVENC, a dedicated hardware encoder on RTX GPUs, for the HEVC video compression standard (also known as H.265). The new mode increases compression by 10% without diminishing quality, making NVENC HEVC 34% more efficient than the typically used x264 Medium compression standard.

DaVinci Resolve and Cyberlink PowerDirector video editing software will be adding support for the new UHQ mode in their next updates. Stay tuned for official launch dates.

RTX Remix Open Sources Creator Toolkit

NVIDIA RTX Remix allows modders to easily capture game assets, automatically enhance materials with generative AI tools and create stunning RTX remasters with full ray tracing.

RTX Remix open beta recently added DLSS 3.5 support featuring Ray Reconstruction, an AI model that creates higher-quality images for intensive ray-traced games and apps.

Later this month, NVIDIA will make the RTX Remix Toolkit open source, allowing more modders to streamline how assets are replaced and scenes are relit. The company is also increasing the supported file formats for RTX Remix’s asset ingestor and bolstering RTX Remix’s AI Texture Tools with new models.

The RTX Remix toolkit is now completely open source.

NVIDIA is also making the capabilities of RTX Remix accessible via a new powerful REST API, allowing modders to livelink RTX Remix to other DCC tools such as Blender and modding tools such as Hammer. NVIDIA is also providing an SDK for the RTX Remix runtime to allow modders to deploy RTX Remix’s renderer into other applications and games beyond DirectX 8 and 9 classics.

Catch Some Rays

NVIDIA DLSS 3.5 with Ray Reconstruction enhances ray-traced image quality on NVIDIA RTX and GeForce RTX GPUs by replacing hand-tuned denoisers with an NVIDIA supercomputer-trained AI network that generates higher-quality pixels in between sampled rays.

Previewing content in the viewport, even with high-end hardware, can sometimes offer less-than-ideal image quality, as traditional denoisers require hand-tuning for every scene. With DLSS 3.5, the AI neural network recognizes a wide variety of scenes, producing high-quality preview images and drastically reducing time spent rendering.

The free browser-based 3D modeling platform Womp has added DLSS 3.5 to enhance interactive, photorealistic modeling in the viewport.

DLSS 3.5 with Ray Reconstruction unlocks sharper visuals in the viewport.

Chaos Vantage and D5 Render, two popular professional-grade 3D apps that feature real-time preview modes with ray tracing, have also seen drastic performance increases with DLSS 3.5 — up to a 60% boost from Ray Reconstruction and 4x from all DLSS technologies.

Tools That Accelerate AI Apps

The vast ecosystem of open-source AI models currently available are usually pretrained for general purposes and run in data centers.

To create more effective app-specific AI tools that run on local PCs, NVIDIA has introduced the RTX AI Toolkit — an end-to-end workflow for the customization, optimization and deployment of AI models on RTX AI PCs.

Partners such as Adobe, Topaz and Blackmagic Design are integrating RTX AI Toolkit within their popular creative apps to accelerate AI performance on RTX PCs.

Developers can learn more on the NVIDIA Technical Blog.

Read More