Enhance image search experiences with Amazon Personalize, Amazon OpenSearch Service, and Amazon Titan Multimodal Embeddings in Amazon Bedrock

Enhance image search experiences with Amazon Personalize, Amazon OpenSearch Service, and Amazon Titan Multimodal Embeddings in Amazon Bedrock

A variety of different techniques have been used for returning images relevant to search queries. Historically, the idea of creating a joint embedding space to facilitate image captioning or text-to-image search has been of interest to machine learning (ML) practitioners and businesses for quite a while. Contrastive Language–Image Pre-training (CLIP) and Bootstrapping Language-Image Pre-training (BLIP) were the first two open source models that achieved near-human results on the task. More recently, however, there has been a trend to use the same techniques used to train powerful generative models to create multimodal models that map text and images to the same embedding space to achieve state-of-the-art results.

In this post, we show how to use Amazon Personalize in combination with Amazon OpenSearch Service and Amazon Titan Multimodal Embeddings from Amazon Bedrock to enhance a user’s image search experience by using learned user preferences to further personalize image searches in accordance with a user’s individual style.

Solution overview

Multimodal models are being used in text-to-image searches across a variety of industries. However, one area where these models fall short is in incorporating individual user preferences into their responses. A user searching for images of a bird, for example, could have many different desired results.

bird 1 bird 2 bird 3
bird 4 bird 5 bird 6

In an ideal world, we can learn a user’s preferences from their previous interactions with images they either viewed, favorited, or downloaded, and use that to return contextually relevant images in line with their recent interactions and style preferences.

Implementing the proposed solution includes the following high-level steps:

  1. Create embeddings for your images.
  2. Store embeddings in a data store.
  3. Create a cluster for the embeddings.
  4. Update the image interactions dataset with the image cluster.
  5. Create an Amazon Personalize personalized ranking solution.
  6. Serve user search requests.

Prerequisites

To implement the proposed solution, you should have the following:

  • An AWS account and familiarity with Amazon Personalize, Amazon SageMaker, OpenSearch Service, and Amazon Bedrock.
  • The Amazon Titan Multimodal Embeddings model enabled in Amazon Bedrock. You can confirm it’s enabled on the Model access page of the Amazon Bedrock console. If Amazon Titan Multimodal Embeddings is enabled, the access status will show as Access granted, as shown in the following screenshot. You can enable access to the model by choosing Manage model access, selecting Amazon Titan Multimodal Embeddings G1, and then choosing Save Changes.

amazon bedrock model access

Create embeddings for your images

Embeddings are a mathematical representation of a piece of information such as a text or an image. Specifically, they are a vector or ordered list of numbers. This representation helps capture the meaning of the image or text in such a way that you can use it to determine how similar images or text are to each other by taking their distance from each other in the embedding space.

bird → [-0.020802604, -0.009943095, 0.0012887075, -0….

As a first step, you can use the Amazon Titan Multimodal Embeddings model to generate embeddings for your images. With the Amazon Titan Multimodal Embeddings model, we can use an actual bird image or text like “bird” as an input to generate an embedding. Furthermore, these embeddings will be close to each other when the distance is measured by an appropriate distance metric in a vector database.

The following code snippet shows how to generate embeddings for an image or a piece of text using Amazon Titan Multimodal Embeddings:

def generate_embeddings_with_titan(image=None, text=None):
    user_input = {}

    if image is not None:
        user_input["inputImage"] = image
    if text is not None:
        user_input["inputText"] = text

    if not user_input:
        raise ValueError("One user input of an image or a text is required")

    body = json.dumps(user_input)

    response = bedrock_runtime.invoke_model(
        body=body,
        modelId="amazon.titan-embed-image-v1",
        accept="application/json",
        contentType="application/json"
    )

    response_body = json.loads(response.get("body").read())

    embedding_error = response_body.get("message")

    if finish_reason is not None:
        raise EmbedError(f"Embeddings generation error: {embedding_error}")

    return response_body.get("embedding")

It’s expected that the image is base64 encoded in order to create an embedding. For more information, see Amazon Titan Multimodal Embeddings G1. You can create this encoded version of your image for many image file types as follows:

with open(Image_Filepath+ "/" + image, "rb") as image_file:
     input_image = base64.b64encode(image_file.read()).decode('utf8')

In this case, input_image can be directly fed to the embedding function you generated.

Create a cluster for the embeddings

As a result of the previous step, a vector representation for each image has been created by the Amazon Titan Multimodal Embeddings model. Because the goal is to create more personalize image search influenced by the user’s previous interactions, you create a cluster out of the image embeddings to group similar images together. This is useful because will force the downstream re-ranker, in this case an Amazon Personalize personalized ranking model, to learn user presences for specific image styles as opposed to their preference for individual images.

In this post, to create our image clusters, we use an algorithm made available through the fully managed ML service SageMaker, specifically the K-Means clustering algorithm. You can use any clustering algorithm that you are familiar with. K-Means clustering is a widely used method for clustering where the aim is to partition a set of objects into K clusters in such a way that the sum of the squared distances between the objects and their assigned cluster mean is minimized. The appropriate value of K depends on the data structure and the problem being solved. Make sure to choose the right value of K, because a small value can result in under-clustered data, and a large value can cause over-clustering.

The following code snippet is an example of how to create and train a K-Means cluster for image embeddings. In this example, the choice of 100 clusters is arbitrary—you should experiment to find a number that is best for your use case. The instance type represents the Amazon Elastic Compute Cloud (Amazon EC2) compute instance that runs the SageMaker K-Means training job. For detailed information on which instance types fit your use case, and their performance capabilities, see Amazon Elastic Compute Cloud instance types. For information about pricing for these instance types, see Amazon EC2 Pricing. For information about available SageMaker notebook instance types, see CreateNotebookInstance.

For most experimentation, you should use an ml.t3.medium instance. This is the default instance type for CPU-based SageMaker images, and is available as part of the AWS Free Tier.

num_clusters = 100

kmeans = KMeans(
    role=role,
    instance_count=1,
    instance_type="ml.t3.medium",
    output_path="s3://your_unique_s3bucket_name/",
    k=num_clusters,
    num_trials=num_clusters,
    epochs=10
)

kmeans.fit(kmeans.record_set(np.asarray(image_embeddings_list, dtype=np.float32)))

Store embeddings and their clusters in a data store

As a result of the previous step, a vector representation for each image has been created and assigned to an image cluster by our clustering model. Now, you need to store this vector such that the other vectors that are nearest to it can be returned in a timely manner. This allows you to input a text such as “bird” and retrieve images that prominently feature birds.

Vector databases provide the ability to store and retrieve vectors as high-dimensional points. They add additional capabilities for efficient and fast lookup of nearest neighbors in the N-dimensional space. They are typically powered by nearest neighbor indexes and built with algorithms like the Hierarchical Navigable Small World (HNSW) and Inverted File Index (IVF) algorithms. Vector databases provide additional capabilities like data management, fault tolerance, authentication and access control, and a query engine.

AWS offers many services for your vector database requirements. OpenSearch Service is one example; it makes it straightforward for you to perform interactive log analytics, real-time application monitoring, website search, and more. For information about using OpenSearch Service as a vector database, see k-Nearest Neighbor (k-NN) search in OpenSearch Service.

For this post, we use OpenSearch Service as a vector database to store the embeddings. To do this, you need to create an OpenSearch Service cluster or use OpenSearch Serverless. Regardless which approach you used for the cluster, you need to create a vector index. Indexing is the method by which search engines organize data for fast retrieval. To use a k-NN vector index for OpenSearch Service, you need to add the index.knn setting and add one or more fields of the knn_vector data type. This lets you search for points in a vector space and find the nearest neighbors for those points by Euclidean distance or cosine similarity, either of which is acceptable for Amazon Titan Multimodal Embeddings.

The following code snippet shows how to create an OpenSearch Service index with k-NN enabled to serve as a vector datastore for your embeddings:

def create_index(opensearch_client, index_name, vector_field_name):
    settings = {
      "settings": {
        "index": {
          "knn": True
        }
      },
      "mappings": {
        "properties": {
            vector_field_name: {
              "type": "knn_vector",
              "dimension": 1024,
              "method": {
                "name": "hnsw",
                "space_type": "l2",
                "engine": "faiss",
                "parameters": {
                  "m": 32
                }
              }
            }
        }
      }
    }
    response = opensearch_client.indices.create(index=index_name, body=settings)
    return bool(response['acknowledged'])

The following code snippet shows how to store an image embedding into the open search service index you just created:

    embedding_vector = {"_index":index_name,
                        "name": image_name, 
                        "type": "Image",
                        "embedding": image_embedding,
			 "cluster": image_cluster }
    #opensearch_client is your Amazon Opensearch cluster client
    opensearch_client.index(
        index=index_name,
        body=embedding_vector,
        id = str(index),
        refresh = True
    )

Update the image interactions dataset with the image cluster

When creating an Amazon Personalize re-ranker, the item interactions dataset represents the user interaction history with your items. Here, the images represent the items and the interactions could consist of a variety of events, such as a user downloading an image, favoriting it, or even viewing a higher resolution version of it. For our use case, we train our recommender on the image clusters instead of the individual images. This gives the model the opportunity to recommend based on the cluster-level interactions and understand the user’s overall stylistic preferences as opposed to preferences for an individual image in the moment.

To do so, update the interaction dataset including the image cluster instead of the image ID in the dataset, and store the file in an Amazon Simple Storage Service (Amazon S3) bucket, at which point it can be brought into Amazon Personalize.

Create an Amazon Personalize personalized ranking campaign

The Personalized-Ranking recipe generates personalized rankings of items. A personalized ranking is a list of recommended items that are re-ranked for a specific user. This is useful if you have a collection of ordered items, such as search results, promotions, or curated lists, and you want to provide a personalized re-ranking for each of your users. Refer to the following example available on GitHub for complete step-by-step instructions on how to create an Amazon Personalize recipe. The high-level steps are as follows:

  1. Create a dataset group.
  2. Prepare and import data.
  3. Create recommenders or custom resources.
  4. Get recommendations.

We create and deploy a personalized ranking campaign. First, you need to create a personalized ranking solution. A solution is a combination of a dataset group and a recipe, which is basically a set of instructions for Amazon Personalize to prepare a model to solve a specific type of business use case. Then you train a solution version and deploy it as a campaign.

The following code snippet shows how to create a Personalized-Ranking solution resource:

personalized_ranking_create_solution_response = personalize_client.create_solution(
    name = "personalized-image-reranker",
    datasetGroupArn = dataset_group_arn,
    recipeArn = personalized_ranking_recipe_arn
)
personalized_ranking_solution_arn = personalized_ranking_create_solution_response['solutionArn']

The following code snippet shows how to create a Personalized-Ranking solution version resource:

personalized_ranking_create_solution_version_response = personalize_client.create_solution_version(
    solutionArn = personalized_ranking_solution_arn
)

personalized_ranking_solution_version_arn = personalized_ranking_create_solution_version_response['solutionVersionArn']

The following code snippet shows how to create a Personalized-Ranking campaign resource:

create_campaign_response = personalize_client.create_campaign(
        name = "personalized-image-reranker-campaign",
        solutionVersionArn = personalized_ranking_solution_version_arn,
        minProvisionedTPS = 1
        )

personalized_ranking_campaign_arn = create_campaign_response['campaignArn']

Serve user search requests

Now our solution flow is ready to serve a user search request and provide personalized ranked results based on the user’s previous interactions. The search query will be processed as shown in the following diagram.

personalized image search architecture

To setup personalized multimodal search, one would execute the following steps:

  1. Multimodal embeddings are created for the image dataset.
  2. A clustering model is created in SageMaker, and each image is assigned to a cluster.
  3. The unique image IDs are replaced with cluster IDs in the image interactions dataset.
  4. An Amazon Personalize personalized ranking model is trained on the cluster interaction dataset.
  5. Separately, the image embeddings are added to an OpenSearch Service vector index.

The following workflow would be executed to process a user’s query:

  1. Amazon API Gateway calls an AWS Lambda function when the user enters a query.
  2. The Lambda function calls the same multimodal embedding function to generate an embedding of the query.
  3. A k-NN search is performed for the query embedding on the vector index.
  4. A personalized score for the cluster ID for each retrieved image is obtained from the Amazon Personalize personalized ranking model.
  5. The scores from OpenSearch Service and Amazon Personalize are combined through a weighted mean. The images are re-ranked and returned to the user.

The weights on each score could be tuned based on the available data and desired outcomes and desired degrees of personalization vs. contextual relevance.

Personalized image search weighted score

To see what this looks like in practice, let’s explore a few examples. In our example dataset, all users would, in absence of any personalization, receive the following images if they search for “cat”.

cat 1 cat 2 cat 3
cat 4 cat 5 cat 6

However, a user who has a history of viewing the following images (let’s call them comic-art-user) clearly has a certain style preference that isn’t addressed by the majority of the previous images.

comic-art-user 1 comic-art-user 2 comic-art-user 3
comic-art-user comic-art-user 5 comic-art-user 6

By combining Amazon Personalize with the vector database capabilities of OpenSearch Service, we are able to return the following results for cats to our user:

comic-art-user-cat-1 comic-art-user-cat-2 comic-art-user-cat-3
comic-art-user-cat-4 comic-art-user-cat-5 comic-art-user-cat-6

In the following example, a user has been viewing or downloading the following images (let’s call them neon-punk-user).

neon-punk-user-1 neon-punk-user-2 neon-punk-user-3

They would receive the following personalized results instead of the mostly photorealistic cats that all users would receive absent any personalization.

neon-punk-user-cat-1 neon-punk-user-cat-2 neon-punk-user-cat-3

Finally, a user viewed or downloaded the following images (let’s call them origami-clay-user).

origami-clay-user-1 origami-clay-user-2 origami-clay-user-3

They would receive the following images as their personalized search results.

origami-clay-user-cat-1 origami-clay-user-2 origami-clay-user-3

These examples illustrate how the search results have been influenced by the users’ previous interactions with other images. By combining the power of Amazon Titan Multimodal Embeddings, OpenSearch Service vector indexing, and Amazon Personalize personalization, we are able to deliver each user relevant search results in alignment with their style preferences as opposed to showing all of them the same generic search result.

Furthermore, because Amazon Personalize is capable of updating based on changes in the user style preference in real time, these search results would update as the user’s style preferences change, for example if they were a designer working for an ad agency who switched mid-browsing session to working on a different project for a different brand.

Clean up

To avoid incurring future charges, delete the resources created while building this solution:

  1. Delete the OpenSearch Service domain or OpenSearch Serverless collection.
  2. Delete the SageMaker resources.
  3. Delete the Amazon Personalize resources.

Conclusion

By combining the power of Amazon Titan Multimodal Embeddings, OpenSearch Service vector indexing and search capabilities, and Amazon Personalize ML recommendations, you can boost the user experience with more relevant items in their search results by learning from their previous interactions and preferences.

For more details on Amazon Titan Multimodal Embeddings, refer to Amazon Titan Multimodal Embeddings G1 model. For more details on OpenSearch Service, refer to Getting started with Amazon OpenSearch Service. For more details on Amazon Personalize, refer to the Amazon Personalize Developer Guide.


About the Authors

Maysara Hamdan is a Partner Solutions Architect based in Atlanta, Georgia. Maysara has over 15 years of experience in building and architecting Software Applications and IoT Connected Products in Telecom and Automotive Industries. In AWS, Maysara helps partners in building their cloud practices and growing their businesses. Maysara is passionate about new technologies and is always looking for ways to help partners innovate and grow.

Eric Bolme is a Specialist Solution Architect with AWS based on the East Coast of the United States. He has 8 years of experience building out a variety of deep learning and other AI use cases and focuses on Personalization and Recommendation use cases with AWS.

Read More

End-to-end LLM training on instance clusters with over 100 nodes using AWS Trainium

End-to-end LLM training on instance clusters with over 100 nodes using AWS Trainium

Llama is Meta AI’s large language model (LLM), with variants ranging from 7 billion to 70 billion parameters. Llama uses a transformers-based decoder-only model architecture, which specializes at language token generation. To train a model from scratch, a dataset containing trillions of tokens is required. The Llama family is one of the most popular LLMs. However, training Llama models can be technically challenging, prolonged, and costly.

In this post, we show you how to accelerate the full pre-training of LLM models by scaling up to 128 trn1.32xlarge nodes, using a Llama 2-7B model as an example. We share best practices for training LLMs on AWS Trainium, scaling the training on a cluster with over 100 nodes, improving efficiency of recovery from system and hardware failures, improving training stability, and achieving convergence. We demonstrate that the quality of Llama 2-7B trained on Trainium is of comparable quality to the open source version on multiple tasks, ranging from multi-task language understanding, math reasoning, to code generation. We also demonstrate the scaling benefits of Trainium.

What makes distributed training across over 100 nodes so challenging?

Training large-scale LLMs requires distributed training across over 100 nodes, and getting elastic access to large clusters of high-performance compute is difficult. Even if you manage to get the required accelerated compute capacity, it’s challenging to manage a cluster of over 100 nodes, maintain hardware stability, and achieve model training stability and convergence. Let’s look at these challenges one by one and how we address them with Trainium clusters during the end-to-end training:

  • Distributed training infrastructure efficiency and scalability – Training LLMs is both computation and memory intensive. In this post, we show you how to enable the different parallel training algorithms on Trainium and select the best hyperparameters to achieve the highest throughput of Llama 2-7B on the Trainium cluster. We also demonstrate the implementations of other memory and computation optimization techniques such as coalescing layers and data type selection on Trainium. Empirically, we have proven that Trainium clusters can reduce costs by up to 46% compared to comparable Amazon Elastic Compute Cloud (Amazon EC2) instances.
  • Efficient hardware and system recovery – End-to-end LLM training at this scale will inevitably encounter hardware or system failures. We demonstrate how to efficiently enable checkpoint saving and automatically recover using the NeuronX Distributed library. Empirically, we demonstrate that with automatic failure recovery, the effective utilization of hardware computing hours reaches 98.81% compared to 77.83% with a manual recovery method.
  • Training stability and convergence – Finally, frequent occurrence of spikes of loss functions in pre-training deep neural networks such as Llama 2 can lead to catastrophic divergence. Due to the large computation cost required for training LLMs, we want to reduce loss function spikes, improve training stability, and achieve convergence of training. We demonstrate best practices and implementation of techniques such as scaled initialization, gradient clipping, and cache management on Trainium clusters to achieve this. We also show how to monitor and debug for training stability.

Llama 2-7B pre-training setup

In this section, we discuss the steps for setting up Llama 2-7B pre-training.

Infrastructure

Setting up the Llama 2-7B infrastructure consists of the following components:

  • EC2 cluster – The training cluster includes 128 trn1.32xlarge instances (nodes), totaling 2048 Trainium accelerators. The networking among the instances is connected through 8×100 Gbps EFAs. We mounted 56 TB Amazon FSx storage for immediate data storage and checkpoint saving and loading. The raw training data was saved on Amazon Simple Storage Service (Amazon S3) buckets.
  • Orchestration – We first trained the Llama 2-7B from scratch using a trn1.32xlarge cluster that is managed through Amazon Elastic Kubernetes Service (Amazon EKS). For details about the setup procedure, refer to Train Llama2 with AWS Trainium on Amazon EKS. We followed the same procedure but set up the cluster at a much larger scale with 128 trn1.32xlarge instances.
  • Container build – We used a customer Docker image that was built based on the following training containers and included the Llama 2-7B training source files. We stored the customer Docker image in an Amazon Elastic Container Registry (Amazon ECR) registry and deployed it in EKS pods. The following diagram shows the architecture of the cluster and container setup.

Data preparation

The original format of the training dataset contains a large number of compressed files. To use this dataset, we first converted them into a format compatible with the Hugging Face dataset package. We used the Apache Arrow format (the default storage format for datasets) to combine all data into a single file and a single block of a file. This method significantly reduces load times for TB-sized datasets compared to the default method of loading many separate files.

We first downloaded the preprocessed training dataset, a small subset of the full dataset that contains 12 trillion tokens, using a special EC2 instance with 20–30 TB of memory. The data download script is as follows:

    import os
     
    # Cache and tmpdir can be large. Make sure ~/ has enough disk space.
    os.environ["HF_DATASETS_CACHE"] = "~/dataset/cache"
    os.environ["TMPDIR"] = "~/dataset/tmpdir"
     
    import datasets
    from datasets import load_dataset
     
    save_path = "~/<data path>/arrow"
    save_path = os.path.expanduser(save_path)
    os.makedirs(save_path, exist_ok=True)
     
    raw_datasets = load_dataset("togethercomputer/<1T data file name>", 'default', num_proc=448)
    raw_datasets["train"].save_to_disk(
        save_path,
        num_shards=1,
        num_proc=448,
    )

The dataset is processed for optimized storage and access:

    import pyarrow as pa
    import time
     
    a = time.time()
    stream = pa.memory_map("~/<data path>/arrow/train.arrow")
    stream = pa.ipc.open_stream(stream)
    table = stream.read_all()
    print("completed step 1 in seconds: ", time.time() - a)
     
    ca = table["text"]
    l = ca.to_pylist()
    schema = pa.schema({"text": pa.large_string()})
    arr = pa.array(l, type=pa.large_string())
     
    with pa.OSFile("~/<data path>/arrow/train.arrow", "wb") as sink:
        with pa.ipc.new_stream(sink, schema=schema) as writer:
            batch = pa.record_batch([arr], schema=schema)
            writer.write(batch)
    print("completed step 2 in seconds: ", time.time() - a)

On the same instance, we cleaned up the dataset and uploaded the clean dataset to an S3 bucket. We then used a 128 trn1.32xlarge cluster to perform tokenization and packaging (such as dynamically filling sequences and applying masking mechanisms) online during training. Compared with offline packaging methods, this online method saves tremendous development time and computing resources, especially for multiple experiments that use different large datasets and tokenizers.

Model hyperparameters

We adopted the same training hyperparameters as Llama models. Specifically, we used a cosine learning rate scheduler with the same maximum learning rate of 3𝑒−4 and the same minimum learning rate of 3𝑒−5. We followed the same linear warmup of 2,000 steps. The following figure shows a plot of the overall learning rate scheduler.

We used the AdamW optimizer with 𝛽1 = 0.9 and 𝛽2 = 0.95. We used weight decay value of 0.1 for all parameters, including normalization weights. For training stability, gradient-norm clipping of 1.0 was applied. For a different model setup, such as Llama 3, these parameters need to be tuned for optimal performance.

Distributed training infrastructure efficiency and scalability

During the training, we applied general optimization techniques, such as activation checkpointing, model and data parallelism, and computation and communication overlapping in Trainium through the Neuron SDK, as well as some unique enhancement such as BF16 with stochastic rounding. In this section, we list the key features and configurations used in our model pre-training to improve training efficiency.

Model and data parallelism

Neuron supports tensor parallelism (TP), pipeline parallelism (PP), sequence parallelism (SP), and data parallelism (DP). For the 7B model with 4,096 sequence length, we found that a TP degree of 8, PP degree of 1, SP degree of 8, and DP degree of 512 yields the highest training throughput. On a trn1.32xlarge instance cluster, this leads to having four model copies per instance.

We used a global batch size of 1,024 sequences with a maximum sequence length of 4,096 tokens. Each step covered about 4 million tokens. The gradient accumulation step is 2, which resulted in the actual batch size per Neuron core being 1. The following figure illustrates the data parallelism and tensor parallelism we applied in the training.

Neuron Distributed library

AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium-based instances. It includes the compiler, runtime, and profiling tools. It supports a variety of data types, including FP32, BF16, FP16, and stochastic rounding. The Neuron SDK enables tensor parallelism, pipeline parallelism, and data parallelism distributed strategies through the NeuronX Distributed library. This allows trade-offs between preserving the high accuracy of trained models and training efficiency in throughput and memory consumption. We applied the following features in the training process:

  • Selective activation checkpointing – We used selective activation checkpointing to improve training efficiency. It has a slightly higher memory cost than full activation checkpointing, but increases the overall training throughput.
  • BF16 with stochastic rounding – We compared three precision settings: BF16, BF16 with SR, and mixed precision training. Empirically, we found that BF16 with SR showed the same convergence behavior as mixed precision training, with higher training throughput and lower memory footprint; whereas the training loss of BF16 diverged. Therefore, we chose BF16 with SR in our pre-training exercise.
  • Coalescing layers with the same inputs – We coalesced linear layers with the same inputs to reduce the communication in tensor and sequence parallelism, and improve the efficiency of matrix operations. Specifically, the Q, K, and V layers in an attention block are coalesced, and the two linear projections layers in SwiGLU are also coalesced. This optimization technique is generic to LLMs. The following are the example code snippets:

q_proj, k_proj, v_proj were merged into qkv_proj

            if not self.config.separate_qkv and self.num_heads == self.num_key_value_heads and self.config.kv_shared_group_size == 1:
                qkv_states = self.qkv_proj(hidden_states)
                query_states, key_states, value_states = qkv_states.split(self.split_size, dim=2)
            elif self.config.qkv_linear:
                query_states, key_states, value_states = self.qkv_proj(hidden_states)
            else:
                query_states = self.q_proj(hidden_states)
                key_states = self.k_proj(hidden_states)
                value_states = self.v_proj(hidden_states)

gate_proj, up_proj were merged into gate_up_proj

gate_proj, up_proj = self.gate_up_proj(x).split(self.split_size, dim=2)
  • Compiler optimization – We used the compiling flag --distribution-strategy=llm-training to enable the compiler to perform optimizations applicable to LLM training runs that shard parameters, gradients, and optimizer states across data parallel workers. We also used --model-type=transformer, which performs optimizations specific to transformer models. We set the Neuron environment variable NEURON_FUSE_SOFTMAX=1 to enable compiler optimizations on custom lowering for Softmax operation. Finally, we used NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3 to reduce training latency with asynchronous runs. This overlaps some runs of accelerators and host (CPU).

The following table summarizes all hyperparameters used in our pre-training exercise.

. . Trn – NxD
Optimization parameters Seq_len 4096
. Precision bf16
. GBS 1024
. learning rate 3.00E-04
. min_lr 3.00E-05
. weight decay 0.1
. grad_clip 1
. LR scheduler cosine
. warmup step 2000
. constant step 0
. AdamW (bete1, beta2) (0.9, 0.95)
. AdamW eps 1.00E-05
Distributed Parameters Number of Nodes 128
. TP 8
. PP 1
. DP 512
. GBS 1024
. Per Neuron BS 1
. Gradient accumulation steps 2
. Sequence Parallel Yes
Steps LR decay steps 480,000
. Training steps 500,000

Hardware and system recovery

Training a billion-parameter LLM often requires training on a cluster with over 100 nodes, running for multiple days or even weeks. The following are best practices of sanity checking the health of the cluster, monitoring cluster health, and efficient recovering from hardware and system failures:

  • Health sanity check and monitoring – It’s important to monitor the health of the computing nodes. In the initial setup, we first did a scrutiny check using the Neuron standard test library to make sure the networking bandwidth performs as expected. During the training, the process can be interrupted due to hardware failures, communication timeouts, and so on. We used Amazon EKS settings to monitor the behavior of the computing nodes. It will send out a warning message if a node or networking goes bad. After that, the cluster stops all the instances and restarts with the health sanity check.
  • Efficient recovery with Neuron automatic fault recovery – To improve the efficiency of fault recovery, NeuronX Distributed supports checkpoint saving and loading. Particularly, it optimizes the checkpoint saving time by supporting asynchronous checkpoint saving. To reduce the overhead of manual intervention, NeuronX Distributed provides an API that automatically loads the latest saved checkpoint before failures and restarts the training. Those APIs are important for achieving high system uptime and therefore finishing end-to-end training. With the automatic node failure recovery and resuming methods, the effective utilization of hardware computing hours reached 98.81% compared to 77.83% with the manual recovery method. The comparison was based on another experimental training run (over 600 billion tokens) without automatic fault recovery, and we observed an average of 20% lower system up time.

Training stability and convergence

During the training process, we found that the training convergence depends on initialization, weight normalization, and gradient synchronization, which can be constantly monitored during the training. The stability depends on reducing frequent distributed file system access. In this section, we discuss the best practices we exercised to improve numeric stability and achieve convergence of the model.

Initialization

We used a scaled initialization strategy for initializing model parameters. Specifically, the initial standard deviation of output layers in attention blocks and MLP layers was scaled by the square root of layer numbers. Similar to what is discussed in the following whitepaper, we found better numerical stability and convergence with smaller initial variance on deeper layers. Additionally, all parameters were initialized on CPU and offloaded to Trainium. The following figure shows that without the scaled initialization (plotted in green and black), the training loss diverged after 22,000–23,000 steps. In contrast, the training loss (plotted in yellow) converges after enabling the scaled initialization. The default initialization is replaced by this code:

scaled_init_method = partial( _init_normal,
config.initializer_range / math.sqrt(2.0 * config.num_hidden_layers))

Gradient synchronization with all-reduce

The gradient all-reduce in torch/xla normalizes the global gradient by world_size instead of data parallelism degrees. When we applied hybrid parallelism including both model parallelism (tensor parallelism and pipeline parallelism) and data parallelism, the world_size was larger than the data parallelism degree. This led to divergence issues because of the incorrect gradient normalization. To fix this, we modified the gradient normalization with a bucket_allreduce_gradients based on data parallelism degrees in NeuronX Distributed. The recommended way is to use neuronx_distributed.parallel_layers.grads.bucket_allreduce_gradients.

Neuron persistent cache on a local worker

When we set up the training cluster, all nodes in the 128 trn1.32xlarge instances shared the same file system, using Amazon FSx for storing data, checkpoints, logs, and so on. Storing the Neuron persistent cache generated from the model compilation on Amazon FSx caused a communication bottleneck because those cached graphs are frequently checked by all Trainium devices in the cluster. Such bottlenecks led to a communication timeout and affected training stability. Therefore, we instead stored Neuron persistent caches (compiled graph binary) in the root volume of each local worker.

Training stability monitoring

During the training, we monitored the training loss, L2-norm of gradients, and L2-norm of parameters for debugging the training stability.

Monitoring the training loss curve gives us the first high-level stability signal. We used TensorBoard to monitor the training loss curve and validation loss curve, as shown in the following figure. The entire model was trained on 1.8 trillion tokens. We observed that the training loss decreases fast for the initial 250 billion tokens and enters a log-linear decrease afterwards.

Monitoring the gradient norm and parameter norms

We monitored the gradient norm as an early signal of divergence. Rapid growth of the gradient norm means (more than three times growth from lowest value) or persistent spikes (benign spikes should return the normal values within a few iterations) can lead to divergence issues. In our training, we observed an ensured gradient norm trending even with BF16, as illustrated in the following figure.

The spikes in our gradient norm often last for a single step and don’t impact the overall training convergence. Specifically, we first tracked a running average (𝑟) of the gradient norm over a window of 20 steps to smooth out the natural fluctuations due to batching. We defined occurrence of a gradient spike when the current gradient norm is higher than 𝑟 + 0.1. Next, we tracked the number of steps for the gradient norm returning to less than 𝑟 + 0.1. Over 86%, the spike deviates from running average for only a single step, as shown in the following figure.

Finally, we also monitored the parameter norm. This metric is a good way to monitor convergence during the initialization stage. For this setup, the initial values are around 1,600, which is expected based on empirical training results from other hardware.

Training results

In this section, we present the results for model quality evaluation and throughput scalability.

Model quality evaluation

The whole training process takes a few weeks. With the saved pre-training model, we benchmarked the model quality based on different tasks and compared it with OpenLlama 2-7B. The following table benchmarks the accuracy over a variety of tasks: MMLU, BBH, common reasoning, world knowledge, reading comprehension, math, and code. For OpenLLaMA 2, we used the available pre-trained weights and evaluated using the same evaluation pipeline as our pre-trained model. Overall, the model trained on Trn1 shows better or comparable accuracy for all tasks except common reasoning.

Task Shots Metric Llama2-7B on trn1 OpenLlama-2
MMLU (5 shot) 5 accuracy 41.318 (3.602) 41.075 (3.611)
BBH (3 shot) 3 multiple_choice_grade 36.565 (1.845) 35.502 (1.861)
Common Reasoning 0 accuracy 56.152 (1.194) 56.893(1.195)
. . accuracy_norm 59.455 (1.206) 61.262(1.19)
World Knowledge (5 shot) Average exact match 38.846 (0.534) 37.023 (0.52)
Reading Comprehension 0 accuracy 72.508 (0.781) 72.416 (0.782)
Math 8 accuracy 9.401 (0.804) 5.231 (0.613)
Code 0 pass@1 7.62 9.06
. . pass@10 19.83 23.58
. . pass@100 34.15 40.24

We also verified that the model accuracy keeps increasing by training more tokens in the dataset. For comparison, we tracked the model accuracy using saved intermediate checkpoints for different tasks, as shown in the following figures.

The first figure shows the model accuracy for world knowledge.

The following figure shows the model accuracy for common reasoning.

The following figure shows the model accuracy for math.

We observed that the accuracy increases with more training tokens for different tasks.

The model quality could be further improved with fine-tuning for specific tasks based on domain specific dataset.

Throughput scalability

In addition to the model quality, we checked the training throughput scaling and got more than 90% scaling efficiency for Llama 2-70B for 64 instances, as shown in the following figure. The Llama 2-7B scaling efficiency is slightly lower because the model size is relatively small for a cluster at this scale.

Clean up

To clean up all the provisioned resources for this post, use the following code and the cleanup script described in Train Llama2 with AWS Trainium on Amazon EKS:

./cleanup.sh

Conclusion

This post showed the end-to-end training example for the Llama 2-7B model with up to 1.8 tokens of dataset on 128 trn1.32xlarge clusters. We discussed best practices to overcome the challenges associated to this type of large model training: hardware stability and recovery, model training stability and convergence, and throughput optimization. The saved training model demonstrated good model quality for the general tasks and showed great cost benefit training on AI purpose-built Trainium accelerators. To learn more about the model architectures supported for training on Trainium and access tutorials, refer to Training Samples/Tutorials.

Reference

HLAT: High-quality Large Language Model Pre-trained on AWS Trainium, https://arxiv.org/pdf/2404.10630


About the Authors

Jianying Lang is a Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in the HPC and AI field. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD in Computational Physics from the University of Colorado at Boulder.

Fei Chen has 15 years’ industry experiences of leading teams in developing and productizing AI/ML at internet scale. At AWS, she leads the worldwide solution teams in Advanced Compute, including AI accelerators, HPC, IoT, visual and spatial compute, and the emerging technology focusing on technical innovations (AI and generative AI) in the aforementioned domains.

Haozheng Fan is a software engineer at AWS. He is interested in large language models (LLMs) in production, including pre-training, fine-tuning, and evaluation. His works span from framework application level to hardware kernel level. He currently works on LLM training on novel hardware, with a focus on training efficiency and model quality.

Hao Zhou is a Research Scientist with Amazon SageMaker. Before that, he worked on developing machine learning methods for fraud detection for Amazon Fraud Detector. He is passionate about applying machine learning, optimization, and generative AI techniques to various real-world problems. He holds a PhD in Electrical Engineering from Northwestern University.

Yida Wang is a principal scientist in the AWS AI team of Amazon. His research interest is in systems, high-performance computing, and big data analytics. He currently works on deep learning systems, with a focus on compiling and optimizing deep learning models for efficient training and inference, especially large-scale foundation models. The mission is to bridge the high-level models from various frameworks and low-level hardware platforms including CPUs, GPUs, and AI accelerators, so that different models can run in high performance on different devices.

Jun (Luke) Huan is a Principal Scientist at AWS AI Labs. Dr. Huan works on AI and data science. He has published more than 160 peer-reviewed papers in leading conferences and journals and has graduated 11 PhD students. He was a recipient of the NSF Faculty Early Career Development Award in 2009. Before joining AWS, he worked at Baidu Research as a distinguished scientist and the head of Baidu Big Data Laboratory. He founded StylingAI Inc., an AI startup, and worked as the CEO and Chief Scientist from 2019–2021. Before joining the industry, he was the Charles E. and Mary Jane Spahr Professor in the EECS Department at the University of Kansas. From 2015–2018, he worked as a program director at the US NSF, in charge of its big data program.

Read More

Fine-tune large multimodal models using Amazon SageMaker

Fine-tune large multimodal models using Amazon SageMaker

Large multimodal models (LMMs) integrate multiple data types into a single model. By combining text data with images and other modalities during training, multimodal models such as Claude3, GPT-4V, and Gemini Pro Vision gain more comprehensive understanding and improved ability to process diverse data types. The multimodal approach allows models to handle a wider range of real-world tasks that involve both text and non-text inputs. In this way, multimodality helps overcome the restrictions of pure text models. LMMs have the potential to profoundly impact various industries, such as healthcare, business analysis, autonomous driving, and so on.

However, a general-purpose language model can only process relatively simple visual tasks such as answering basic questions about an image or generating short captions. This is primarily due to the lack of access to detailed pixel-level information, object segmentation data, and other granular annotations that would allow the model to precisely understand and reason about the various elements, relationships, and context within an image. Without this fine-grained visual understanding, the language model is constrained to more superficial, high-level analysis and generation capabilities related to images. Fine-tuning LMMs on domain-specific data can significantly improve their performance for targeted tasks. The prospect of fine-tuning open source multimodal models like LLaVA are highly appealing because of their cost effectiveness, scalability, and impressive performance on multimodal benchmarks. For those seeking flexible and economical solutions, the ability to use and customize these powerful models holds immense potential.

In this blog post, we demonstrate how to fine-tune and deploy the LLaVA model on Amazon SageMaker. The source code is available in this GitHub repository.

LLaVA overview

LLaVA is trained end-to-end to enable general-purpose understanding across both visual and textual data. In the LLaVA model architecture, pre-trained language models such as Vicuna or LLaMA are combined with visual models such as CLIP’s visual encoder. The integration converts the visual features from images into a format that matches the language model’s embeddings through a projection layer.

LLaVA training happens in two stages, as shown in Figure 1 that follows. The first stage is pre-training, which uses image-text pairs to align the visual features with the language model’s embeddings. In this stage, the visual encoder and language model weights are kept frozen, and only the projection matrix is trained. The second stage is fine-tuning the whole model end-to-end. Here, the visual encoder’s weights are frozen, while the projection layer and language model are updated.

Figure 1: LLaVA architecture

Prepare data

When it comes to fine-tuning the LLaVA model for specific tasks or domains, data preparation is of paramount importance because having high-quality, comprehensive annotations enables the model to learn rich representations and achieve human-level performance on complex visual reasoning challenges. In this post, we focus on preparing an instruction dataset.

Data annotation

The dataset should contain image text pairs that involve reasoning to answer questions about images. To help the model gain comprehensive understanding during the training process, text data should be enriched with contextual nuances. For example, instead of simply asking the model to describe the image, ask specific questions about the image and relating to its content.

To demonstrate LLaVA’s capabilities, we created a small synthetic dataset focused on understanding and interpreting infographics and charts. We used Amazon Bedrock and Python for this task. Specifically, we employed the Amazon Bedrock LLaMA2-70B model to generate text descriptions and question-answer pairs based on those descriptions. Subsequently, we used Python to generate different types of visual presentation such as pie charts and funnel charts based on the text descriptions. If you already have an existing dataset, this method can be used as a data augmentation technique to expand your dataset and potentially enhance the fine-tuning outcome. By creating synthetic examples of text descriptions, question-answer pairs, and corresponding charts, you can augment your dataset with multimodal examples tailored to your specific use case.

The dataset we created consists of image-text pairs, with each image being an infographic, chart, or other data visualization. The corresponding text is a series of questions about the infographic along with ground truth answers, formatted in a question-answer style intended to resemble how a human might ask the model about the information contained in the image. Some examples of generated questions for images as shown in Figure 2 include:

  • What is the percentage of people who spend less than 2 hours a day on screen time?
  • What proportion of people do not exercise at all weekly?
  • How many people are teachers?

Figure 2: Example charts in the training dataset (left is a pie chart of distribution of daily screen time, right is a funnel chart of occupation)

Data structure

These image-text pairs must be formatted in JSON lines (.jsonl) format, where each line is a training sample. An example training sample follows. Specifically, the id field is the unique identifier of a training sample, the image field specifies the name of the image, and the conversations field provides a question-and-answer pair.

{
  "id": "1",
  "image": "screen_time.png",
  "conversations": [
    {
      "from": "human",
      "value": "What is the percentage of people who spend less than 2 hours a day on screen time?"
    },
    {
      "from": "gpt",
      "value": "15%"
    }
  ]
}

By training the model to answer in-depth and analytical questions about infographics it hasn’t seen before, we aim to strengthen model’s ability to generalize its understanding of data visualizations and draw accurate insights.

Fine tune the model

After the data is prepared, we upload it to Amazon Simple Storage Service (Amazon S3) as the SageMaker training input. In configuring the SageMaker training job, we use the TrainingInput object to specify the input data location in Amazon S3 and define how SageMaker should handle it during training. In this case, input_mode='FastFile' indicates the use of S3 fast file mode, which is ideal for scenarios where the dataset is stored as individual files in S3. S3 fast file mode is also advantageous when working with large datasets or when fast access to data is critical for training performance.

from sagemaker.inputs import TrainingInput

training_input = TrainingInput(
    s3_data_type="S3Prefix",  # Available Options: S3Prefix | ManifestFile | AugmentedManifestFile
    s3_data=s3uri,
    distribution="FullyReplicated",  # Available Options: FullyReplicated | ShardedByS3Key
    input_mode="FastFile",
)

We will reuse the training script from LLaVA, which uses DeepSpeed for training efficiency. DeepSpeed is a library that helps train very large deep learning models faster and more efficiently. ZeRO, short for Zero Redundancy Optimizer, is a memory optimization technique in DeepSpeed that reduces the required memory footprint for data parallelism by partitioning optimization states and gradients across data-parallel processes, enabling larger model sizes and batch sizes within limited GPU memory. This allows you to train much larger models on the same hardware. ZeRO Stage 2 reduces memory usage by splitting the model’s optimizer state, gradients, and parameters across multiple processes. Each process only stores a part of these, reducing the memory needed per process. If you run into CUDA memory errors with this configuration, try the Stage 3 configuration instead. Stage 3 offloads gradients to the CPU, which slows training but might solve the memory issue. The training command follows. See the LLaVA: Large Language and Vision Assistant on GitHub for more details about the training parameters

#!/bin/bash
# Set the prompt and model versions directly in the command
deepspeed /root/LLaVA/llava/train/train_mem.py 
--deepspeed /root/LLaVA/scripts/zero2.json 
--lora_enable True 
--lora_r 128 
--lora_alpha 256 
--mm_projector_lr 2e-5 
--bits 4 
--model_name_or_path /root/LLaVA/llava/llava-v1.5-7b 
--version llava_llama_2 
--data_path /root/dataset/train/dataset.json 
--validation_data_path /root/dataset/validation/dataset.json 
--image_folder /root/dataset/images/ 
--vision_tower openai/clip-vit-large-patch14-336 
--mm_projector_type mlp2x_gelu 
--mm_vision_select_layer -2 
--mm_use_im_start_end False 
--mm_use_im_patch_token False 
--image_aspect_ratio pad 
--group_by_modality_length True 
--bf16 True 
--output_dir /root/LLaVA/llava/checkpoints/llama-2-7b-chat-task-qlora 
--num_train_epochs 500 
--per_device_train_batch_size 32 
--per_device_eval_batch_size 32 
--gradient_accumulation_steps 1 
--evaluation_strategy “epoch” 
--save_strategy "steps" 
--save_steps 50000 
--save_total_limit 1 
--learning_rate 2e-4 
--weight_decay 0. 
--warmup_ratio 0.03 
--lr_scheduler_type "cosine" 
--logging_steps 1 
--tf32 True 
--model_max_length 2048 
--gradient_checkpointing True 
--dataloader_num_workers 4 
--lazy_preprocess True 
--report_to wandb

LLaVA allows you to fine-tune all parameters of the base model or use LoRA to tune a smaller number of parameters. LoRA’s strategy keeps the original pre-trained model backbone unchanged and adds new, easier-to-train layers. This allows quick adaptation to new tasks without retraining the whole network. You can use the lora_enable parameter to specify the fine-tuning method. For full parameter fine-tuning, ml.p4d.24xlarge is recommended, while ml.g5.12xlarge is sufficient for LoRA fine-tuning if LLaMA-13B language model is used.

The following code initializes a SageMaker Estimator using the HuggingFace SDK. It sets up a SageMaker training job to run the custom training script from LLaVA. This allows the script to be run within the SageMaker managed environment, benefiting from its scalability. Then we bring our own Docker container to run the SageMaker training job. You can download the Docker image from this code repo, where the dependencies of the training LLaVA model are installed. To learn more about how to adapt your own Docker container to work with SageMaker, see adapting your own training container.

huggingface_estimator = HuggingFace(
    entry_point="finetune-lora-piechart-QA.sh",
    source_dir="./LLaVA",
    instance_type=instance_type,
    instance_count=instance_count,
    py_version=PYTHON_VERSION,
    image_uri=CONTAINER_URI,
    role=ROLE,
    metric_definitions=metric_definitions,
    environment=environment,
    use_spot_instances=use_spot_instances,
    max_run=max_run,
    max_wait=max_wait,
    output_path=output_uri,
    checkpoint_s3_uri=checkpoint_uri,
)

For logging purpose, you can use metric definitions to extract key metrics from the training script’s printed logs and send them to Amazon CloudWatch. The following is an example metric definition that logs training loss at each epoch, the model’s learning rate, and training throughput.

metric_definitions = [
    {"Name": "loss", "Regex": "'loss': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "learning_rate", "Regex": "'learning_rate': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "epoch", "Regex": "'epoch': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "train_runtime", "Regex": "'epoch': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "train_samples_per_second", "Regex": "'epoch': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "train_steps_per_second", "Regex": "'epoch': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "train_loss", "Regex": "'epoch': ([0-9]+(.|e-)[0-9]+),?"},
]

Deploy and test

After the training job finishes, the fine-tuned model is uploaded to Amazon S3. You can then use the following code to deploy the model on SageMaker.

HF_TASK = "question-answering"
config = dict(HF_TASK=HF_TASK)
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data=s3_model_path,
    role=get_execution_role(),
    transformers_version=TRANSFORMERS_VERSION,
    pytorch_version=PYTORCH_VERSION,
    py_version=PYTHON_VERSION,
    model_server_workers=1,
    env=config,
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=instance_count, instance_type=instance_type
)

For testing, provide an image and question pair and make an inference call against the SageMaker endpoint as follows:

prompt = "what is this chart about?"
data = {
    "image": http_img_path,
    "question": prompt,
    "temperature": 0.1,
}
output = predictor.predict(data)

Conclusion

Our exploration into fine-tuning the LLaVA visual language model on Sagemaker for a custom visual question answering task has shed light on the advancements made in bridging the gap between textual and visual comprehension. LLaVA represents a significant step forward in multimodal AI, demonstrating the ability to jointly understand and reason about textual and visual information in a unified model. By using large-scale pretraining on image-text pairs, LLaVA has acquired robust visiolinguistic representations that can be effectively adapted to downstream tasks through fine-tuning. This enables LLaVA to excel at tasks that require deep comprehension of both modalities, such as visual question answering, image captioning, and multimodal information retrieval. However, the fine-tuning mechanism has limitations. In particular, the adjustment of the projection layer and language model themselves while freezing the vision model presents a set of challenges, such as the requirement for a massive amount of data and the lack of capability in handling challenging vision tasks. Confronting these challenges directly allows us to unlock the full potential of multimodal models, paving the way for more sophisticated applications.

Acknowledgement

The authors extend their gratitude to Manoj Ravi, Jenny Vega, and Santhosh Kuriakose for their insightful feedback and review of the post.

Reference


About the Authors

Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, and spending time with friends and families.

Jun Shi is a Senior Solutions Architect at Amazon Web Services (AWS). His current areas of focus are AI/ML infrastructure and applications. He has over a decade experience in the FinTech industry as software engineer.

Alfred Shen is a Senior AI/ML Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on CV, NLP, and multimodality. His work has been showcased in publications such as EMNLP, ICLR, and Public Health.

Read More

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker

Mixture of Experts (MoE) architectures for large language models (LLMs) have recently gained popularity due to their ability to increase model capacity and computational efficiency compared to fully dense models. By utilizing sparse expert subnetworks that process different subsets of tokens, MoE models can effectively increase the number of parameters while requiring less computation per token during training and inference. This enables more cost-effective training of larger models within fixed compute budgets compared to dense architectures.

Despite their computational benefits, training and fine-tuning large MoE models efficiently presents some challenges. MoE models can struggle with load balancing if the tokens aren’t evenly distributed across experts during training, and some experts may become overloaded while others are under-utilized. MoE models have high memory requirements, because all expert parameters need to be loaded into memory even though only a subset is used for each input.

In this post, we highlight new features of the Amazon SageMaker model parallelism library that enable efficient training of MoE models using expert parallelism. Expert parallelism is a type of parallelism that handles splitting experts of an MoE model across separate workers or devices, similar to how tensor parallelism can partition dense model layers. We demonstrate how to use these new features of SMP by pre-training the 47 billion parameter Mixtral 8x7B MoE model using expert parallelism. To learn more, refer to our GitHub repo and Expert parallelism.

Expert parallelism

The Mixtral 8x7B model has a sparse MoE architecture, containing eight expert subnetworks with around 7 billion parameters each. A trainable gate network called a router determines which input tokens are sent to which expert. With this architecture, the experts specialize in processing different aspects of the input data. The complete Mixtral 8x7B model has a total of 47 billion parameters, but only around 12.9 billion (two experts, for this model architecture) are activated for any given input token; this results in improved computational efficiency relative to a dense model of the same total size. To learn more about the MoE architecture in general, refer to Applying Mixture of Experts in LLM Architectures.

SMP adds support for expert parallelism

SMP now supports expert parallelism, which is essential to performant MoE model training. With expert parallelism, different expert subnetworks that comprise the MoE layers are placed on separate devices. During training, different data is routed to the different devices, with each device handling the computation for the experts it contains. By distributing experts across workers, expert parallelism addresses the high memory requirements of loading all experts on a single device and enables MoE training on a larger cluster. The following figure offers a simplified look at how expert parallelism works on a multi-GPU cluster.

The SMP library uses NVIDIA Megatron to implement expert parallelism and support training MoE models, and runs on top of PyTorch Fully Sharded Data Parallel (FSDP) APIs. You can keep using your PyTorch FSDP training code as is and activate SMP expert parallelism for training MoE models. SMP offers a simplified workflow where you need to specify the expert_parallel_degree parameter, which will evenly divide experts across the number of GPUs in your cluster. For example, to shard your model while using an instance with 8 GPUs, you can set the expert_parallel_degree to 2, 4, or 8. We recommend that you start with a small number and gradually increase it until the model fits in the GPU memory.

SMP’s expert parallelism is compatible with sharded data parallelism

SMP’s expert parallel implementation is compatible with sharded data parallelism, enabling more memory-efficient and faster training. To understand how this works, consider an MoE model in the following example with eight experts (N=8) training on a simple cluster with one node containing 4 GPUs.

SMP’s expert parallelism splits the MoE experts across GPUs. You control how many experts are instantiated on each device by using the expert_parallel_degree parameter. For example, if you set the degree to 2, SMP will assign half of the eight experts to each data parallel group. The degree value must be a factor of the number of GPUs in your cluster and the number of experts in your model. Data is dynamically routed to and from the GPU or GPUs hosting the selected expert using all-to-all GPU communication.

Next, sharded data parallelism partitions and distributes the experts as well as the non-MoE layers of the model, like attention or routers, across your cluster to reduce the memory footprint of the model. The hybrid_shard_degree parameter controls this. For example, a hybrid_shard_degree of 2 will shard the model states (including experts and non-MoE layers) across half of the GPUs in our cluster. The product of expert_parallel_degree and hybrid_shard_degree should not exceed the world size of the cluster. In the following example, hybrid_shard_degree * expert_parallel_degree = 4 is a valid configuration.

Solution overview

With the background out of the way, let’s dig into the components of our distributed training architecture. The following diagram illustrates the solution architecture.

In this example, we use SageMaker training jobs. With SageMaker training jobs, you can launch and manage clusters of high-performance instances with simple API calls. For example, you can use the SageMaker Estimator to specify the type and quantity of instances to use in your distributed systems with just a few lines of code. Later in this post, we use a cluster of two ml.p4d.24xlarge instances to train our model by specifying these parameters in our Estimator. To learn about SageMaker training jobs, see Train a Model with Amazon SageMaker.

In this post, we use the SMP library to efficiently distribute the workload across the cluster using hybrid sharded data parallelism and expert parallelism. In addition to these implementations, SMP offers many other performance-improving and memory-saving techniques, such as:

  • Mixed precision training and fp8 support for dense Llama models (which accelerates distributed training and takes advantage of the performance improvements on P5 instances)
  • Tensor parallelism composable with sharded data parallelism
  • Delayed parameter initialization
  • Activation checkpointing (a technique to reduce memory usage by clearing activations of certain layers and recomputing them during the backward pass)

For the latest updates, refer to SageMaker model parallelism library v2.

Along with SMP, this example also uses the SageMaker distributed data parallel library (SMDDP). As you scale your workload and add instances to your cluster, the overhead of communication between instances also increases, which can lead to a drop in overall computational performance and training efficiency. This is where SMDDP helps. SMDDP includes optimized communication collectives such as AllGather that are designed for AWS network infrastructure. Because of this, SMDDP can outperform other more general communications libraries such as NCCL when training on SageMaker.

Together, the SMP and SMDDP libraries can accelerate large distributed training workloads by up to 20%. Additionally, these libraries are compatible with standard PyTorch APIs and capabilities, which makes it convenient to adapt any existing PyTorch FSDP training script to the SageMaker training platform and take advantage of the performance improvements that SMP and SMDDP provide. To learn more, see SageMaker model parallelism library v2 and Run distributed training with the SageMaker distributed data parallelism library.

In the following sections, we showcase how you can accelerate distributed training of the Hugging Face Transformers Mixtral 8*7B model on P4 instances using SMP and SMDDP.

Prerequisites

You need to complete some prerequisites before you can run the Mixtral notebook.

First, make sure you have created a Hugging Face access token so you can download the Hugging Face tokenizer to be used later. After you have the access token, you need to make a few quota increase requests for SageMaker. You need to request a minimum of 2 P4d instances ranging to a maximum of 8 P4d instances (depending on time-to-train and cost-to-train trade-offs for your use case).

On the Service Quotas console, request the following SageMaker quotas:

  • P4 instances (ml.p4d.24xlarge) for training job usage: 2–8

It may take up to 24 hours for the quota increase to get approved.

Now that you’re ready to begin the process to pre-train the Mixtral model, we start with dataset preparation in the next step.

Prepare the dataset

We begin our tutorial with preparing the dataset. This will cover loading the GLUE/SST2 dataset, tokenizing and chunking the dataset, and configuring the data channels for SageMaker training on Amazon Simple Storage Service (Amazon S3). Complete the following steps:

  1. You first need to load the GLUE/SST2 dataset and split it into training and validation datasets:
    hyperparameters = {
        "cache_dir": "tmp",
        "dataset_config_name": "sst2",
        "dataset_name": "glue",
        "do_train": True,
        "do_eval": True,
    }
    
    raw_datasets = load_dataset(
        hyperparameters["dataset_name"],
        hyperparameters["dataset_config_name"],
    )
    
    del raw_datasets["validation"]
    
    if "validation" not in raw_datasets.keys():
        validation_percentage = "10%"
    
        raw_datasets["validation"] = load_dataset(
            hyperparameters["dataset_name"],
            hyperparameters["dataset_config_name"],
            split=f"train[:{validation_percentage}]",
            cache_dir=hyperparameters["cache_dir"],
        )
    
        raw_datasets["train"] = load_dataset(
            hyperparameters["dataset_name"],
            hyperparameters["dataset_config_name"],
            split=f"train[{validation_percentage}:]",
            cache_dir=hyperparameters["cache_dir"],
        )

  2. Load the Mixtral-8x7B tokenizer from the Hugging Face Transformers library:
    tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1", **tokenizer_kwargs)

Next, you define two utility functions: tokenize_function() and group_texts(). The tokenize_function() runs the tokenizer on the text data. The group_texts() function concatenates all texts from the dataset and generates chunks of a block size that corresponds to the model’s input length (2048) for this example. By chunking the text data into smaller pieces, you make sure the model can process the entire dataset during training, even if some text examples are longer than the input length (2048).

  1. Define the functions with the following code:
    def tokenize_function(examples):
        ...
        
        output = tokenizer(examples[text_column_name])
        return output
    def group_texts(examples):
        # Concatenate all texts.
        concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
            # Split by chunks of max_len.
            result = {
                k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
                for k, t in concatenated_examples.items()
            }
        result["labels"] = result["input_ids"].copy()
        return result

  2. Call the preceding utility functions on your dataset to tokenize and generate chunks suitable for the model:
    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True,num_proc=1,remove_columns=column_names)
    lm_datasets = tokenized_datasets.map(group_texts, batched=True)

  3. Prepare the training and validation datasets for SageMaker training by saving them as JSON files and constructing the S3 paths where these files will be uploaded:
    train_dataset = lm_datasets["train"]
    train_dataset.to_json("./training.json")
    training_dataset_location = f"s3://{default_bucket}/dataset/train/"
    
     
    eval_dataset = lm_datasets["validation"]
    eval_dataset.to_json("./validation.json")
    validation_dataset_location = f"s3://{default_bucket}/dataset/validation/"

  4. Finally, set up the data channels for SageMaker training by creating TrainingInput objects from the provided S3 bucket paths for the training and test/validation datasets:
    train = sagemaker.inputs.TrainingInput(
                s3_train_bucket, distribution="FullyReplicated", 
                s3_data_type="S3Prefix"
            )
    data_channels = {"train": train}
    
    test = sagemaker.inputs.TrainingInput(
                s3_test_bucket, distribution="FullyReplicated", 
                s3_data_type="S3Prefix"
            )
    data_channels["test"] = test

You’re now ready to run pre-training or fine-tuning on the dataset.

Pre-train Mixtral 8x7B with expert parallelism on SMP

To pre-train the Mixtral 8x7B model, complete the following steps:

  1. Initialize the script with torch.sagemaker.init() to activate the SMP library:
    import torch.sagemaker as tsm
    tsm.init()

  2. Import the MoEConfig class from the torch.sagemaker.transform API. We use the MoEConfig class to enable the model to use the SMP implementation of MoE:
    from torch.sagemaker.moe.moe_config import MoEConfig

  3. Create a model configuration for Mixtral 8x7B model. This will be passed to AutoModelForCausalLM.from_config(model_config, attn_implementation="flash_attention_2") from the Hugging Face Transformers library to initialize the model with random weights. If you want to fine-tune, you can provide the path to the pre-trained weights instead of the model configuration.
    model_config = MixtralConfig(
                vocab_size=args.vocab_size, # 32000,
                hidden_size=args.hidden_width, # 4096,
                intermediate_size=args.intermediate_size, # 14336,
                num_hidden_layers=args.num_layers, # 32,
                num_attention_heads=args.num_heads, # 32,
                num_key_value_heads=args.num_key_value_heads, # 8,
                hidden_act="silu",
                max_position_embeddings=args.max_context_width, # 4096 * 32,
                initializer_range=args.initializer_range, # 0.02,
                rms_norm_eps=1e-5,
                use_cache=False,
                pad_token_id=None,
                bos_token_id=1,
                eos_token_id=2,
                tie_word_embeddings=False,
                rope_theta=1e6,
                sliding_window=args.sliding_window, # None,
                attention_dropout=0.0,
                num_experts_per_tok=args.num_experts_per_tok, # 2,
                num_local_experts=args.num_local_experts, # 8,
                output_router_logits=False,
                router_aux_loss_coef=0.001,
            )
           
    model = AutoModelForCausalLM.from_config(model_config, dtype=dtype, attn_implementation="flash_attention_2" )

In the example Jupyter Notebook, you use a create_model() function that invokes the AutoModelForCausalLM.from_config() function.

  1. Create the SMP MoE configuration class. In the following code, you specify parameters in the training estimator in the subsequent steps. To learn more about the SMP MoEConfig class, see torch.sagemaker.moe.moe_config.MoEConfig.
    moe_config = MoEConfig(
                        smp_moe=args.use_smp_implementation > 0, #Whether to use the SMP-implementation of MoE. The default value is True.
                        random_seed=args.seed, # A seed number for the random operations in expert-parallel distributed modules. This seed will be added to the expert parallel rank to set the actual seed for each rank. It is unique for each expert parallel rank. The default value is 12345.
                        moe_load_balancing=args.moe_load_balancing, #Specify the load balancing type of the MoE router. Valid options are aux_loss, sinkhorn, balanced, and none. The default value is sinkhorn.
                        global_token_shuffle=args.global_token_shuffle > 0,  #Whether to shuffle tokens across EP ranks within the same expert parallel group. The default value is False
                        moe_all_to_all_dispatcher=args.moe_all_to_all_dispatcher > 0, #Whether to use all-to-all dispatcher for the communications in MoE. The default value is True.
                    )

  2. With the model and MoE configuration ready, you wrap the model with the SMP transform API and pass the MoE configuration. Here, the tsm.transform method adapts the model from Hugging Face format to SMP format. For more information, refer to torch.sagemaker.transform.
    model = tsm.transform(
            model, 
            config=moe_config,
        )

  3. Define the training hyperparameters, including the MoE configuration and other settings specific to the model and training setup:
    hyperparameters = {
        # MoE config
        "moe": 1,
        "moe_load_balancing": "sinkhorn",
        "moe_all_to_all_dispatcher": 1,
        "seed": 12345,
        #rest of hyperparameters
        ...
        "model_type": "mixtral",
        "sharding_strategy": "hybrid_shard",
        "delayed_param": 1, 
        "epochs": 100,
        "activation_checkpointing": 1,
        "beta1": 0.9,
        "bf16": 1,
        "fp8": 0,
        "checkpoint_dir": "/opt/ml/checkpoints",
        ...
        ...
        
    }

We enable delayed parameter initialization in SMP, which allows initializing large models on a meta device without attaching data. This can resolve limited GPU memory issues when you first load the model. This approach is particularly useful for training LLMs with tens of billions of parameters, where even CPU memory might not be sufficient for initialization.

SMP supports various routing strategies, including sinkhorn, balanced, and aux_loss. Each provides distinct load balancing approaches to achieve equitable token assignment among experts, thereby maintaining balanced workload distribution.

  1. Specify the parameters for expert_parallel_degree and hybrid_shard_degree:
    expert_parallel_degree = 2  # An integer in [1, world_size]
    hybrid_shard_degree = (
        8  # An integer in [0, world_size // expert_parallel_degree] and its default value is 0.
    )

Hybrid sharding is a memory saving technique between `FULL_SHARD` and `NO_SHARD`, with `FULL_SHARD` saving the most memory and `NO_SHARD` not saving any. This technique shards parameters within the hybrid shard degree (HSD) group and replicates parameters across groups. The HSD controls sharding across GPUs and can be set to an integer from 0 to `world_size`.

An HSD of 8 applies `FULL_SHARD` within a node and then replicates parameters across nodes because there are 8 GPUs in the nodes we are using. This results in reduced communication volume because expensive all-gathers and reduce-scatters are only done within a node, which can be more performant for medium-sized models. Generally, you want to use the smallest HSD that doesn’t cause out of memory (OOM) errors. If you’re experiencing OOM, try increasing the hybrid shard degree to reduce memory usage on each node.

  1. With all the necessary configurations in place, you now create the PyTorch estimator function to encapsulate the training setup and launch the training job. We run the pre-training on the 2 ml.p4d.24xlarge instances, where each instance contains 8 A100 Nvidia GPUs:
    smp_estimator = PyTorch(
        entry_point="train.py",
        hyperparameters=hyperparameters,
        role=role,
        checkpoint_s3_uri=checkpoint_s3_uri,
        checkpoint_local_path=hyperparameters["checkpoint_dir"] 
        instance_type="ml.p4d.24xlarge",
        volume_size=400,
        instance_count=2,
        sagemaker_session=sagemaker_session,
        ...
        distribution={
            "torch_distributed": {
                "enabled": True,
            },
            "smdistributed": {
                "modelparallel": {
                    "enabled": True,
                    "parameters": {
                        "activation_loading_horizon": activation_loading_horizon,
                        "hybrid_shard_degree": hybrid_shard_degree,
                        "sm_activation_offloading": offload_activations,
                        "expert_parallel_degree": expert_parallel_degree,
                    },
                }
            },
        },
        py_version="py310",
        framework_version="2.2.0",
        output_path=s3_output_bucket,
    )

  2. Finally, launch the pre-training workload:
    smp_estimator.fit(inputs=data_channels)

Clean up

As part of cleanup, you can delete the SageMaker default bucket created to host the GLUE/SST2 dataset.

Conclusion

Training large MoE language models like the 47 billion parameter Mistral 8x7B can be challenging due to high computational and memory requirements. By using expert parallelism and sharded data parallelism from the SageMaker model parallelism library, you can effectively scale these MoE architectures across multiple GPUs and workers.

SMP’s expert parallelism implementation seamlessly integrates with PyTorch and the Hugging Face Transformers library, allowing you to enable MoE training using simple configuration flags without changing your existing model code. Additionally, SMP provides performance optimizations like hybrid sharding, delayed parameter initialization, and activation offloading and recomputation to further improve training efficiency.

For the complete sample to pre-train and fine-tune Mixtral 8x7B, see the GitHub repo.

Special thanks

Special thanks to Rahul Huilgol, Gautam Kumar, and Luis Quintela for their guidance and engineering leadership in developing this new capability.


About the Authors

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS based in Munich, Germany. Roy helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS. Roy is passionate about computational optimization problems and improving the performance of AI workloads.

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads frameworks, compilers, and optimization techniques for deep learning training.

Teng Xu is a Software Development Engineer in the Distributed Training group in AWS AI. He enjoys reading.

Suhit Kodgule is a Software Development Engineer with the AWS Artificial Intelligence group working on deep learning frameworks. In his spare time, he enjoys hiking, traveling, and cooking.

Read More

Generating fashion product descriptions by fine-tuning a vision-language model with SageMaker and Amazon Bedrock

Generating fashion product descriptions by fine-tuning a vision-language model with SageMaker and Amazon Bedrock

In the world of online retail, creating high-quality product descriptions for millions of products is a crucial, but time-consuming task. Using machine learning (ML) and natural language processing (NLP) to automate product description generation has the potential to save manual effort and transform the way ecommerce platforms operate. One of the main advantages of high-quality product descriptions is the improvement in searchability. Customers can more easily locate products that have correct descriptions, because it allows the search engine to identify products that match not just the general category but also the specific attributes mentioned in the product description. For example, a product that has a description that includes words such as “long sleeve” and “cotton neck” will be returned if a consumer is looking for a “long sleeve cotton shirt.” Furthermore, having factoid product descriptions can increase customer satisfaction by enabling a more personalized buying experience and improving the algorithms for recommending more relevant products to users, which raise the probability that users will make a purchase.

With the advancement of Generative AI, we can use vision-language models (VLMs) to predict product attributes directly from images. Pre-trained image captioning or visual question answering (VQA) models perform well on describing every-day images but can’t to capture the domain-specific nuances of ecommerce products needed to achieve satisfactory performance in all product categories. To solve this problem, this post shows you how to predict domain-specific product attributes from product images by fine-tuning a VLM on a fashion dataset using Amazon SageMaker, and then using Amazon Bedrock to generate product descriptions using the predicted attributes as input. So you can follow along, we’re sharing the code in a GitHub repository.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

You can use a managed service, such as Amazon Rekognition, to predict product attributes as explained in Automating product description generation with Amazon Bedrock. However, if you’re trying to extract specifics and detailed characteristics of your product or your domain (industry), fine-tuning a VLM on Amazon SageMaker is necessary.

Vision-language models

Since 2021, there has been a rise in interest in vision-language models (VLMs), which led to the release of solutions such as Contrastive Language-Image Pre-training (CLIP) and Bootstrapping Language-Image Pre-training (BLIP). When it comes to tasks such as image captioning, text-guided image generation, and visual question-answering, VLMs have demonstrated state-of-the art performance.

In this post, we use BLIP-2, which was introduced in BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, as our VLM. BLIP-2 consists of three models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model (LLM). We use a version of BLIP-2, that contains Flan-T5-XL as the LLM.

The following diagram illustrates the overview of BLIP-2:

Blip-2 architecture

Figure 1: BLIP-2 overview

The pre-trained version of the BLIP-2 model has been demonstrated in Build an image-to-text generative AI application using multimodality models on Amazon SageMaker and Build a generative AI-based content moderation solution on Amazon SageMaker JumpStart. In this post, we demonstrate how to fine-tune BLIP-2 for a domain-specific use case.

Solution overview

The following diagram illustrates the solution architecture.

Solution architecture

Figure 2: High-level solution architecture

The high-level overview of the solution is:

  • An ML scientist uses Sagemaker notebooks to process and split the data into training and validation data.
  • The datasets are uploaded to Amazon Simple Storage Service (Amazon S3) using the S3 client (a wrapper around an HTTP call).
  • Then the Sagemaker client is used to launch a Sagemaker Training job, again a wrapper for an HTTP call.
  • The training job manages the copying of the datasets from S3 to the training container, the training of the model, and the saving of its artifacts to S3.
  • Then, through another call of the Sagemaker client, an endpoint is generated, copying the model artifacts into the endpoint hosting container.
  • The inference workflow is then invoked through an AWS Lambda request, which first makes an HTTP request to the Sagemaker endpoint, and then uses that to make another request to Amazon Bedrock.

In the following sections, we demonstrate how to:

  • Set up the development environment
  • Load and prepare the dataset
  • Fine-tune the BLIP-2 model to learn product attributes using SageMaker
  • Deploy the fine-tuned BLIP-2 model and predict product attributes using SageMaker
  • Generate product descriptions from predicted product attributes using Amazon Bedrock

Set up the development environment

An AWS account is needed with an AWS Identity and Access Management (IAM) role that has permissions to manage resources created as part of the solution. For details, see Creating an AWS account.

We use Amazon SageMaker Studio with the ml.t3.medium instance and the Data Science 3.0 image. However, you can also use an Amazon SageMaker notebook instance or any integrated development environment (IDE) of your choice.

Note: Be sure to set up your AWS Command Line Interface (AWS CLI) credentials correctly. For more information, see Configure the AWS CLI.

An ml.g5.2xlarge instance is used for SageMaker Training jobs, and an ml.g5.2xlarge instance is used for SageMaker endpoints. Ensure sufficient capacity for this instance in your AWS account by requesting a quota increase if required. Also check the pricing of the on-demand instances.

You need to clone this GitHub repository for replicating the solution demonstrated in this post. First, launch the notebook main.ipynb in SageMaker Studio by selecting the Image as Data Science and Kernel as Python 3. Install all the required libraries mentioned in the requirements.txt.

Load and prepare the dataset

For this post, we use the Kaggle Fashion Images Dataset, which contain 44,000 products with multiple category labels, descriptions, and high resolution images. In this post we want to demonstrate how to fine-tune a model to learn attributes such as fabric, fit, collar, pattern, and sleeve length of a shirt using the image and a question as inputs.

Each product is identified by an ID such as 38642, and there is a map to all the products in styles.csv. From here, we can fetch the image for this product from images/38642.jpg and the complete metadata from styles/38642.json. To fine-tune our model, we need to convert our structured examples into a collection of question and answer pairs. Our final dataset has the following format after processing for each attribute:

Id | Question | Answer
38642 | What is the fabric of the clothing in this picture? | Fabric: Cotton

After we process the dataset, we split it into training and validation sets, create CSV files, and upload the dataset to Amazon S3.

Fine-tune the BLIP-2 model to learn product attributes using SageMaker

To launch a SageMaker Training job, we need the HuggingFace Estimator. SageMaker starts and manages all of the necessary Amazon Elastic Compute Cloud (Amazon EC2) instances for us, supplies the appropriate Hugging Face container, uploads the specified scripts, and downloads data from our S3 bucket to the container to /opt/ml/input/data.

We fine-tune BLIP-2 using the Low-Rank Adaptation (LoRA) technique, which adds trainable rank decomposition matrices to every Transformer structure layer while keeping the pre-trained model weights in a static state. This technique can increase training throughput and reduce the amount of GPU RAM required by 3 times and the number of trainable parameters by 10,000 times. Despite using fewer trainable parameters, LoRA has been demonstrated to perform as well as or better than the full fine-tuning technique.

We prepared entrypoint_vqa_finetuning.py which implements fine-tuning of BLIP-2 with the LoRA technique using Hugging Face Transformers, Accelerate, and Parameter-Efficient Fine-Tuning (PEFT). The script also merges the LoRA weights into the model weights after training. As a result, you can deploy the model as a normal model without any additional code.

from peft import LoraConfig, get_peft_model
from transformers import Blip2ForConditionalGeneration
 
model = Blip2ForConditionalGeneration.from_pretrained(
        "Salesforce/blip2-flan-t5-xl",
        device_map="auto",
        cache_dir="/tmp",
        load_in_8bit=True,
    )

config = LoraConfig(
    r=8, # Lora attention dimension.
    lora_alpha=32, # the alpha parameter for Lora scaling.
    lora_dropout=0.05, # the dropout probability for Lora layers.
    bias="none", # the bias type for Lora.
    target_modules=["q", "v"],
)

model = get_peft_model(model, config)

We reference entrypoint_vqa_finetuning.py as the entry_point in the Hugging Face Estimator.

from sagemaker.huggingface import HuggingFace

hyperparameters = {
    'epochs': 10,
    'file-name': "vqa_train.csv",
}

estimator = HuggingFace(
    entry_point="entrypoint_vqa_finetuning.py",
    source_dir="../src",
    role=role,
    instance_count=1,
    instance_type="ml.g5.2xlarge", 
    transformers_version='4.26',
    pytorch_version='1.13',
    py_version='py39',
    hyperparameters = hyperparameters,
    base_job_name="VQA",
    sagemaker_session=sagemaker_session,
    output_path=f"{output_path}/models",
    code_location=f"{output_path}/code",
    volume_size=60,
    metric_definitions=[
        {'Name': 'batch_loss', 'Regex': 'Loss: ([0-9\.]+)'},
        {'Name': 'epoch_loss', 'Regex': 'Epoch Loss: ([0-9\.]+)'}
    ],
)

We can start our training job by running with the .fit() method and passing our Amazon S3 path for images and our input file.

estimator.fit({"images": images_input, "input_file": input_file})

Deploy the fine-tuned BLIP-2 model and predict product attributes using SageMaker

We deploy the fine-tuned BLIP-2 model to the SageMaker real time endpoint using the HuggingFace Inference Container. You can also use the large model inference (LMI) container, which is described in more detail in Build a generative AI-based content moderation solution on Amazon SageMaker JumpStart, which deploys a pre-trained BLIP-2 model. Here, we reference our fine-tuned model in Amazon S3 instead of the pre-trained model available in the Hugging Face hub. We first create the model and deploy the endpoint.

from sagemaker.huggingface import HuggingFaceModel

model = HuggingFaceModel(
   model_data=estimator.model_data,
   role=role,
   transformers_version="4.28",
   pytorch_version="2.0",
   py_version="py310",
   model_server_workers=1,
   sagemaker_session=sagemaker_session
)

endpoint_name = "endpoint-finetuned-blip2"
model.deploy(initial_instance_count=1, instance_type="ml.g5.2xlarge", endpoint_name=endpoint_name )

When the endpoint status becomes in service, we can invoke the endpoint for the instructed vision-to-language generation task with an input image and a question as a prompt:

inputs = {
    "prompt": "What is the sleeve length of the shirt in this picture?",
    "image": image # image encoded in Base64
}

The output response looks like the following:

{"Sleeve Length": "Long Sleeves"}

Generate product descriptions from predicted product attributes using Amazon Bedrock

To get started with Amazon Bedrock, request access to the foundational models (they are not enabled by default). You can follow the steps in the documentation to enable model access. In this post, we use Anthropic’s Claude in Amazon Bedrock to generate product descriptions. Specifically, we use the model anthropic.claude-3-sonnet-20240229-v1 because it provides good performance and speed.

After creating the boto3 client for Amazon Bedrock, we create a prompt string that specifies that we want to generate product descriptions using the product attributes.

You are an expert in writing product descriptions for shirts. Use the data below to create product description for a website. The product description should contain all given attributes.
Provide some inspirational sentences, for example, how the fabric moves. Think about what a potential customer wants to know about the shirts. Here are the facts you need to create the product descriptions:
[Here we insert the predicted attributes by the BLIP-2 model]

The prompt and model parameters, including maximum number of tokens used in the response and the temperature, are passed to the body. The JSON response must be parsed before the resulting text is printed in the final line.

bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-west-2')

model_id = "anthropic.claude-3-sonnet-20240229-v1"

body = json.dumps(
    {"system": prompt, "messages": attributes_content, "max_tokens": 400, "temperature": 0.1, "anthropic_version": "bedrock-2023-05-31"}
)

response = bedrock.invoke_model(
    body=body,
    modelId=model_id,
    accept='application/json',
    contentType='application/json'
)

The generated product description response looks like the following:

"Classic Striped Shirt Relax into comfortable casual style with this classic collared striped shirt. With a regular fit that is neither too slim nor too loose, this versatile top layers perfectly under sweaters or jackets."

Conclusion

We’ve shown you how the combination of VLMs on SageMaker and LLMs on Amazon Bedrock present a powerful solution for automating fashion product description generation. By fine-tuning the BLIP-2 model on a fashion dataset using Amazon SageMaker, you can predict domain-specific and nuanced product attributes directly from images. Then, using the capabilities of Amazon Bedrock, you can generate product descriptions from the predicted product attributes, enhancing the searchability and personalization of ecommerce platforms. As we continue to explore the potential of generative AI, LLMs and VLMs emerge as a promising avenue for revolutionizing content generation in the ever-evolving landscape of online retail. As a next step, you can try fine-tuning this model on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use cases.


About the Authors 

antoniaAntonia Wiebeler is a Data Scientist at the AWS Generative AI Innovation Center, where she enjoys building proofs of concept for customers. Her passion is exploring how generative AI can solve real-world problems and create value for customers. While she is not coding, she enjoys running and competing in triathlons.

danielDaniel Zagyva is a Data Scientist at AWS Professional Services. He specializes in developing scalable, production-grade machine learning solutions for AWS customers. His experience extends across different areas, including natural language processing, generative AI, and machine learning operations.

lunLun Yeh is a Machine Learning Engineer at AWS Professional Services. She specializes in NLP, forecasting, MLOps, and generative AI and helps customers adopt machine learning in their businesses. She graduated from TU Delft with a degree in Data Science & Technology.

fotinosFotinos Kyriakides is an AI/ML Consultant at AWS Professional Services specializing in developing production-ready ML solutions and platforms for AWS customers. In his free time Fotinos enjoys running and exploring.

Read More

Create a multimodal assistant with advanced RAG and Amazon Bedrock

Create a multimodal assistant with advanced RAG and Amazon Bedrock

Retrieval Augmented Generation (RAG) models have emerged as a promising approach to enhance the capabilities of language models by incorporating external knowledge from large text corpora. However, despite their impressive performance in various natural language processing tasks, RAG models still face several limitations that need to be addressed.

Naive RAG models face limitations such as missing content, reasoning mismatch, and challenges in handling multimodal data. Although they can retrieve relevant information, they may struggle to generate complete and coherent responses when required information is absent, leading to incomplete or inaccurate outputs. Additionally, even with relevant information retrieved, the models may have difficulty correctly interpreting and reasoning over the content, resulting in inconsistencies or logical errors. Furthermore, effectively understanding and reasoning over multimodal data remains a significant challenge for these primarily text-based models.

In this post, we present a new approach named multimodal RAG (mmRAG) to tackle those existing limitations in greater detail. The solution intends to address these limitations for practical generative artificial intelligence (AI) assistant use cases. Additionally, we examine potential solutions to enhance the capabilities of large language models (LLMs) and visual language models (VLMs) with advanced LangChain capabilities, enabling them to generate more comprehensive, coherent, and accurate outputs while effectively handling multimodal data. The solution uses Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, providing a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Solution architecture

The mmRAG solution is based on a straightforward concept: to extract different data types separately, you generate text summarization using a VLM from different data types, embed text summaries along with raw data accordingly to a vector database, and store raw unstructured data in a document store. The query will prompt the LLM to retrieve relevant vectors from both the vector database and document store and generate meaningful and accurate answers.

The following diagram illustrates the solution architecture.

The architecture diagram depicts the mmRAG architecture that integrates advanced reasoning and retrieval mechanisms. It combines text, table, and image (including chart) data into a unified vector representation, enabling cross-modal understanding and retrieval. The process begins with diverse data extractions from various sources such as URLs and PDF files by parsing and preprocessing text, table, and image data types separately, while table data is converted into raw text and image data into captions.

These parsed data streams are then fed into a multimodal embedding model, which encodes the various data types into uniform, high dimensional vectors. The resulting vectors, representing the semantic content regardless of original format, are indexed in a vector database for efficient approximate similarity searches. When a query is received, the reasoning and retrieval component performs similarity searches across this vector space to retrieve the most relevant information from the vast integrated knowledge base.

The retrieved multimodal representations are then used by the generation component to produce outputs such as text, images, or other modalities. The VLM component generates vector representations specifically for textual data, further enhancing the system’s language understanding capabilities. Overall, this architecture facilitates advanced cross-modal reasoning, retrieval, and generation by unifying different data modalities into a common semantic space.

Developers can access mmRAG source codes on the GitHub repo.

Configure Amazon Bedrock with LangChain

You start by configuring Amazon Bedrock to integrate with various components from the LangChain Community library. This allows you to work with the core FMs. You use the BedrockEmbeddings class to create two different embedding models: one for text (embedding_bedrock_text) and one for images (embeddings_bedrock_image). These embeddings represent textual and visual data in a numerical format, which is essential for various natural language processing (NLP) tasks.

Additionally, you use the LangChain Bedrock and BedrockChat classes to create a VLM model instance (llm_bedrock_claude3_haiku) from Anthropic Claude 3 Haiku and a chat instance based on a different model, Sonnet (chat_bedrock_claude3_sonnet). These instances are used for advanced query reasoning, argumentation, and retrieval tasks. See the following code snippet:

from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.chat_models.bedrock import BedrockChat

embedding_bedrock_text = BedrockEmbeddings(client=boto3_bedrock, model_id="amazon.titan-embed-g1-text-02")
embeddings_bedrock_image = BedrockEmbeddings(client=boto3_bedrock, model_id="amazon.titan-embed-image-v1")

model_kwargs =  { 
    "max_tokens": 2048,
    "temperature": 0.0,
    "top_k": 250,
    "top_p": 1,
    "stop_sequences": ["nnn"],
}
chat_bedrock_claude3_haiku = BedrockChat(
        model_id="anthropic:claude-3-haiku-20240307-v1:0", 
        client=boto3_bedrock,
        model_kwargs=model_kwargs,
    )
 
chat_bedrock_claude3_sonnet = BedrockChat(
        model_id="anthropic.claude-3-sonnet-20240229-v1:0", 
        client=boto3_bedrock,
        model_kwargs=model_kwargs,
    )

Parse content from data sources and embed both text and image data

In this section, we explore how to harness the power of Python to parse text, tables, and images from URLs and PDFs efficiently, using two powerful packages: Beautiful Soup and PyMuPDF. Beautiful Soup, a library designed for web scraping, makes it straightforward to sift through HTML and XML content, allowing you to extract the desired data from web pages. PyMuPDF offers an extensive set of functionalities for interacting with PDF files, enabling you to extract not just text but also tables and images with ease. See the following code:

from bs4 import BeautifulSoup as Soup
import fitz

def parse_tables_images_from_urls(url:str):
    ...
     # Parse the HTML content using BeautifulSoup
    soup = Soup(response.content, 'html.parser')

    # Find all table elements
    tables = soup.find_all('table')
    # Find all image elements
    images = soup.find_all('img')
    ...
 
def parse_images_tables_from_pdf(pdf_path:str):
    ...
    pdf_file = fitz.open(pdf_path)

    # Iterate through each page
    for page_index in range(len(pdf_file)): 
        # Select the page
        page = pdf_file[page_index]

        # Search for tables on the page
        tables = page.find_tables()
        df = table.to_pandas()
        
        # Search for images on the page
        images = page.get_images()
        image_info = pdf_file.extract_image(xref)
        image_data = image_info["image"]
       ...

The following code snippets demonstrate how to generate image captions using Anthropic Claude 3 by invoking the bedrock_get_img_description utility function. Additionally, they showcase how to embed image pixels along with image captioning using the Amazon Titan image embedding model amazon.titan_embeding_image_v1 by calling the get_text_embedding function.

image_caption = bedrock_get_img_description(model_id, 
            prompt='You are an expert at analyzing images in great detail. Your task is to carefully examine the provided 
                    mage and generate a detailed, accurate textual description capturing all of the important elements and 
                    context present in the image. Pay close attention to any numbers, data, or quantitative information visible, 
                    and be sure to include those numerical values along with their semantic meaning in your description. 
                    Thoroughly read and interpret the entire image before providing your detailed caption describing the 
                    image content in text format. Strive for a truthful and precise representation of what is depicted',
            image=image_byteio, 
            max_token=max_token, 
            temperature=temperature, 
            top_p=top_p, 
            top_k=top_k, 
            stop_sequences='Human:')    
            
image_sum_vectors = get_text_embedding(image_base64=image_base64, text_description=image_caption,  embd_model_id=embd_model_id)        

Embedding and vectorizing multimodality data

You can harness the capabilities of the newly released Anthropic Claude 3 Sonnet and Haiku on Amazon Bedrock, combined with the Amazon Titan image embedding model and LangChain. This powerful combination allows you to generate comprehensive text captions for tables and images, seamlessly integrating them into your content. Additionally, you can store vectors, objects, raw image file names, and source documents in an Amazon OpenSearch Serverless vector store and object store. Use the following code snippets to create image captions by invoking the utility function bedrock_get_img_description. Embed image pixels along with image captions using the Amazon Titan image embedding model amazon.titan_embeding_image_v1 by calling the get_text_embedding functions.

def get_text_embedding(image_base64=None, text_description=None,  embd_model_id:str="amazon.titan-embed-image-v1"):
    input_data = {}
    if image_base64 is not None:
        input_data["inputImage"] = image_base64
    if text_description is not None:
        input_data["inputText"] = text_description
    if not input_data:
        raise ValueError("At least one of image_base64 or text_description must be provided")
    body = json.dumps(input_data)
    response = boto3_bedrock.invoke_model(
        body=body,
        modelId=embd_model_id,
        accept="application/json",
        contentType="application/json"
    )
    response_body = json.loads(response.get("body").read())
    return response_body.get("embedding")
    
image_caption = bedrock_get_img_description(model_id, 
            prompt='You are an expert at analyzing images in great detail. Your task is to carefully examine the provided 
                    mage and generate a detailed, accurate textual description capturing all of the important elements and 
                    context present in the image. Pay close attention to any numbers, data, or quantitative information visible, 
                    and be sure to include those numerical values along with their semantic meaning in your description. 
                    Thoroughly read and interpret the entire image before providing your detailed caption describing the 
                    image content in text format. Strive for a truthful and precise representation of what is depicted',
            image=image_byteio, 
            max_token=max_token, 
            temperature=temperature, 
            top_p=top_p, 
            top_k=top_k, 
            stop_sequences='Human:')    
            
image_sum_vectors = get_text_embedding(image_base64=image_base64, text_description=image_sum,  embd_model_id=embd_model_id) 

You can consult the provided code examples for more information on how to embed multimodal and insert vector documents into the OpenSearch Serverless vector store. For more information about data access, refer to Data access control for Amazon OpenSearch Serverless.

# Form a data dictionary with image metatadata, raw image object store location and base64 encoded image data
document = {
    "doc_source": image_url,
    "image_filename": s3_image_path,
    "embedding": image_base64
}
# Parse out only the iamge name from the full temp path
filename = f"jsons/{image_path.split('/')[-1].split('.')[0]}.json"

# Writing the data dict into JSON data
with open(filename, 'w') as file:
    json.dump(document, file, indent=4)

#Load all json files from the temp directory  
loader = DirectoryLoader("./jsons", glob='**/*.json', show_progress=False, loader_cls=TextLoader)

#loader = DirectoryLoader("./jsons", glob='**/*.json', show_progress=True, loader_cls=JSONLoader, loader_kwargs = {'jq_schema':'.content'})
new_documents = loader.load()
new_docs = text_splitter.split_documents(new_documents)
   
# Insert into AOSS
new_docsearch = OpenSearchVectorSearch.from_documents(
    new_docs,
    bedrock_embeddings,
    opensearch_url=host,
    http_auth=auth,
    timeout = 100,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    index_name=new_index_name,
    engine="faiss",
)

Advanced RAG with fusion and decomposition

Fusion in RAG presents an innovative search strategy designed to transcend the limitations of conventional search techniques, aligning more closely with the complex nature of human inquiries. This initiative elevates the search experience by integrating multi-faceted query generation and using Reciprocal Rank Fusion for an enhanced re-ranking of search outcomes. This approach offers a more nuanced and effective way to navigate the vast expanse of available information, catering to the intricate and varied demands of users’ searches.

The following diagram illustrates this workflow.

We use the Anthropic Claude 3 Sonnet and Haiku models, which possess the capability to process visual and language data, which enables them to handle the query decomposition (Haiku) and answer fusion (Sonnet) stages effectively. The following code snippet demonstrates how to create a retriever using OpenSearch Serverless:

from langchain.vectorstores import OpenSearchVectorSearch
retriever  =  OpenSearchVectorSearch(
    opensearch_url = "{}.{}.aoss.amazonaws.com".format(<collection_id>, <my_region>),
    index_name = <index_name>,
    embedding_function = embd)

The combination of decomposition and fusion intend to address the limitations of the chain-of-thought (CoT) method in language models. It involves breaking down complex problems into simpler, sequential sub-problems, where each sub-problem builds upon the solution of the previous one. This technique significantly enhances the problem-solving abilities of language models in areas such as symbolic manipulation, compositional generalization, and mathematical reasoning.

The RAG-decomposition approach, which uses the decomposition step (see the following code), underscores the potential of a technique called least-to-most prompting. This technique not only improves upon existing methods but also paves the way for more advanced, interactive learning frameworks for language models. The ultimate goal is to move towards a future where language models can learn from bidirectional conversations, enabling more effective reasoning and problem-solving capabilities.

# Decomposition
prompt_rag = hub.pull("rlm/rag-prompt")
template = """You are a helpful assistant that generates multiple sub-questions related to an input question. n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. n
Generate multiple search queries semantically related to: {question} n
Output (5 queries):"""
prompt_decomposition = ChatPromptTemplate.from_template(template)
generate_queries_decomposition = ( prompt_decomposition | llm_bedrock | StrOutputParser() | (lambda x: x.split("n")))
questions = generate_queries_decomposition.invoke({"question":question})

def reciprocal_rank_fusion(results: list[list], k=60):

    # Initialize a dictionary to hold fused scores for each unique document
    fused_scores = {}

    # Iterate through each list of ranked documents
    for docs in results:
        # Iterate through each document in the list, with its rank (position in the list)
        for rank, doc in enumerate(docs):
            # Convert the document to a string format to use as a key (assumes documents can be serialized to JSON)
            doc_str = dumps(doc)
            # If the document is not yet in the fused_scores dictionary, add it with an initial score of 0
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            # Retrieve the current score of the document, if any
            previous_score = fused_scores[doc_str]
            # Update the score of the document using the RRF formula: 1 / (rank + k)
            fused_scores[doc_str] += 1 / (rank + k)
    # Sort the documents based on their fused scores in descending order to get the final reranked results
    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]
    # Return the reranked results as a list of tuples, each containing the document and its fused score
    return reranked_results
    
def retrieve_and_rag(question,prompt_rag,sub_question_generator_chain):
    sub_questions = sub_question_generator_chain.invoke({"question":question})
    # Initialize a list to hold RAG chain results
    rag_results = []
    for sub_question in sub_questions:   
        # Retrieve documents for each sub-question with reciprocal reranking
        retrieved_docs = retrieval_chain_rag_fusion.invoke({"question": sub_question})
        # Use retrieved documents and sub-question in RAG chain
        answer = (prompt_rag 
            | chat_bedrock
            | StrOutputParser()
            | reciprocal_rank_fusion
            ).invoke({"context": retrieved_docs,"question": sub_question} 
        rag_results.append(answer)
    return rag_results,sub_questions
    
def format_qa_pairs(questions, answers):
    """Format Q and A pairs"""
    
    formatted_string = ""
    for i, (question, answer) in enumerate(zip(questions, answers), start=1):
        formatted_string += f"Question {i}: {question}nAnswer {i}: {answer}nn"
    return formatted_string.strip()

context = format_qa_pairs(questions, answers)

# Prompt
template = """Here is a set of Q+A pairs:

{context}

Use these to synthesize an answer to the question: {question}
"""
prompt_fusion = ChatPromptTemplate.from_template(template)
final_rag_chain = (prompt_fusion | llm_bedrock| StrOutputParser())

# Decompsing and reciprocal reranking
retrieval_chain_rag_fusion = generate_queries_decomposition | retriever.map() | reciprocal_rank_fusion
 
# Wrap the retrieval and RAG process in a RunnableLambda for integration into a chain
answers, questions = retrieve_and_rag(question, prompt_rag, generate_queries_decomposition)
final_rag_chain.invoke({"context":context,"question":question})

The RAG process is further enhanced by integrating a reciprocal re-ranker, which uses sophisticated NLP techniques. This makes sure the retrieved results are relevant and also semantically aligned with the user’s intended query. This multimodal retrieval approach seamlessly operates across vector databases and object stores, marking a significant advancement in the quest for more efficient, accurate, and contextually aware search mechanisms.

Multimodality retrievals

The mmRAG architecture enables the system to understand and process multimodal queries, retrieve relevant information from various sources, and generate multimodal answers by combining textual, tabular, and visual information in a unified manner. The following diagram highlights the data flows from queries to answers by using an advanced RAG and a multimodal retrieval engine powered by a multimodal embedding model (amazon.titan-embed-image-v1), an object store (Amazon S3), and a vector database (OpenSearch Serverless). For tables, the system retrieves relevant table locations and metadata, and computes the cosine similarity between the multimodal embedding and the vectors representing the table and its summary. Similarly, for images, the system retrieves relevant image locations and metadata, and computes the cosine similarity between the multimodal embedding and the vectors representing the image and its caption.

# Connect to the AOSS with given host and index name
docsearch = OpenSearchVectorSearch(
    index_name=index_name,  # TODO: use the same index-name used in the ingestion script
    embedding_function=bedrock_embeddings,
    opensearch_url=host,  # TODO: e.g. use the AWS OpenSearch domain instantiated previously
    http_auth=auth,
    timeout = 100,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    engine="faiss",
)

# Query for images with text
query = "What is the math and reasoning score MMMU (val) for Anthropic Claude 3 Sonnet ?"
t2i_results = docsearch.similarity_search_with_score(query, k=3)  # our search query  # return 3 most relevant docs

# Or Query AOSS with image aka image-to-image
with open(obj_image_path, "rb") as image_file:
    image_data = image_file.read()
    image_base64 = base64.b64encode(image_data).decode('utf8')
    image_vectors = get_image_embedding(image_base64=image_base64)
    i2i_results = docsearch.similarity_search_with_score_by_vector(image_vectors, k=3)  # our search query  # return 3 most relevant docs

The following screenshot illustrates the improved accuracy and comprehensive understanding of the user’s query with multimodality capability. The mmRAG approach is capable of grasping the intent behind the query, extracting relevant information from the provided chart, and estimating the overall costs, including the estimated output token size. Furthermore, it can perform mathematical calculations to determine the cost difference. The output includes the source chart and a link to its original location.

Use cases and limitations

Amazon Bedrock offers a comprehensive set of generative AI models for enhancing content comprehension across various modalities. By using the latest advancements in VLMs, such as Anthropic Claude 3 Sonnet and Haiku, as well as the Amazon Titan image embedding model, Amazon Bedrock enables you to expand your document understanding beyond text to include tables, charts, and images. The integration of OpenSearch Serverless provides enterprise-grade vector storage and approximate k-NN search capabilities, enabling efficient retrieval of relevant information. With advanced LangChain decomposition and fusion techniques, you can use multi-step querying across different LLMs to improve accuracy and gain deeper insights. This powerful combination of cutting-edge technologies allows you to unlock the full potential of multimodal content comprehension, enabling you to make informed decisions and drive innovation across various data sources.

The reliance on visual language models and image embedding models for comprehensive and accurate image captions has its limitations. Although these models excel at understanding visual and textual data, the multi-step query decomposition, reciprocal ranking, and fusion processes involved can lead to increased inference latency. This makes such solutions less suitable for real-time applications or scenarios that demand instantaneous responses. However, these solutions can be highly beneficial in use cases where higher accuracy and less time-sensitive responses are required, allowing for more detailed and accurate analysis of complex visual and textual data.

Conclusion

In this post, we discussed how you can use multimodal RAG to address limitations in multimodal generative AI assistants. We invite you to explore mmRAG and take advantage of the advanced features of Amazon Bedrock. These powerful tools can assist your business in gaining deeper insights, making well-informed decisions, and fostering innovation driven by more accurate data. Ongoing research efforts are focused on developing an agenic and graph-based pipeline to streamline the processes of parsing, injection, and retrieval. These approaches hold the promise of enhancing the reliability and reusability of the mmRAG system.

Acknowledgement

Authors would like to expression sincere gratitude to Nausheen Sayed, Karen Twelves, Li Zhang, Sophia Shramko, Mani Khanuja, Santhosh Kuriakose, and Theresa Perkins for their comprehensive reviews.


About the Authors

Alfred Shen is a Senior AI/ML Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on CV, NLP, and multimodality. His work has been showcased in publications such as EMNLP, ICLR, and Public Health.

Changsha Ma is an generative AI Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, mentoring college students for entrepreneurship, and spending time with friends and families.

Julianna Delua is a Principal Specialist for AI/ML and generative AI. She serves the financial services industry customers including those in Capital Markets, Fintech and Payments. Julianna enjoys helping businesses turn new ideas into solutions and transform the organizations with AI-powered solutions.

Read More

How 20 Minutes empowers journalists and boosts audience engagement with generative AI on Amazon Bedrock

How 20 Minutes empowers journalists and boosts audience engagement with generative AI on Amazon Bedrock

This post is co-written with Aurélien Capdecomme and Bertrand d’Aure from 20 Minutes.

With 19 million monthly readers, 20 Minutes is a major player in the French media landscape. The media organization delivers useful, relevant, and accessible information to an audience that consists primarily of young and active urban readers. Every month, nearly 8.3 million 25–49-year-olds choose 20 Minutes to stay informed. Established in 2002, 20 Minutes consistently reaches more than a third (39 percent) of the French population each month through print, web, and mobile platforms.

As 20 Minutes’s technology team, we’re responsible for developing and operating the organization’s web and mobile offerings and driving innovative technology initiatives. For several years, we have been actively using machine learning and artificial intelligence (AI) to improve our digital publishing workflow and to deliver a relevant and personalized experience to our readers. With the advent of generative AI, and in particular large language models (LLMs), we have now adopted an AI by design strategy, evaluating the application of AI for every new technology product we develop.

One of our key goals is to provide our journalists with a best-in-class digital publishing experience. Our newsroom journalists work on news stories using Storm, our custom in-house digital editing experience. Storm serves as the front end for Nova, our serverless content management system (CMS). These applications are a focus point for our generative AI efforts.

In 2023, we identified several challenges where we see the potential for generative AI to have a positive impact. These include new tools for newsroom journalists, ways to increase audience engagement, and a new way to ensure advertisers can confidently assess the brand safety of our content. To implement these use cases, we rely on Amazon Bedrock.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon Web Services (AWS) through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

This blog post outlines various use cases where we’re using generative AI to address digital publishing challenges. We dive into the technical aspects of our implementation and explain our decision to choose Amazon Bedrock as our foundation model provider.

Identifying challenges and use cases

Today’s fast-paced news environment presents both challenges and opportunities for digital publishers. At 20 Minutes, a key goal of our technology team is to develop new tools for our journalists that automate repetitive tasks, improve the quality of reporting, and allow us to reach a wider audience. Based on this goal, we have identified three challenges and corresponding use cases where generative AI can have a positive impact.

The first use case is to use automation to minimize the repetitive manual tasks that journalists perform as part of the digital publishing process. The core work of developing a news story revolves around researching, writing, and editing the article. However, when the article is complete, supporting information and metadata must be defined, such as an article summary, categories, tags, and related articles.

While these tasks can feel like a chore, they are critical to search engine optimization (SEO) and therefore the audience reach of the article. If we can automate some of these repetitive tasks, this use case has the potential to free up time for our newsroom to focus on core journalistic work while increasing the reach of our content.

The second use case is how we republish news agency dispatches at 20 Minutes. Like most news outlets, 20 Minutes subscribes to news agencies, such as the Agence France-Presse (AFP) and others, that publish a feed of news dispatches covering national and international news. 20 Minutes journalists select stories relevant to our audience and rewrite, edit, and expand on them to fit the editorial standards and unique tone our readership is used to. Rewriting these dispatches is also necessary for SEO, as search engines rank duplicate content low. Because this process follows a repeatable pattern, we decided to build an AI-based tool to simplify the republishing process and reduce the time spent on it.

The third and final use case we identified is to improve transparency around the brand safety of our published content. As a digital publisher, 20 Minutes is committed to providing a brand-safe environment for potential advertisers. Content can be classified as brand-safe or not brand-safe based on its appropriateness for advertising and monetization. Depending on the advertiser and brand, different types of content might be considered appropriate. For example, some advertisers might not want their brand to appear next to news content about sensitive topics such as military conflicts, while others might not want to appear next to content about drugs and alcohol.

Organizations such as the Interactive Advertising Bureau (IAB) and the Global Alliance for Responsible Media (GARM) have developed comprehensive guidelines and frameworks for classifying the brand safety of content. Based on these guidelines, data providers such as the IAB and others conduct automated brand safety assessments of digital publishers by regularly crawling websites such as 20minutes.fr and calculating a brand safety score.

However, this brand safety score is site-wide and doesn’t break down the brand safety of individual news articles. Given the reasoning capabilities of LLMs, we decided to develop an automated per-article brand safety assessment based on industry-standard guidelines to provide advertisers with a real-time, granular view of the brand safety of 20 Minutes content.

Our technical solution

At 20 Minutes, we’ve been using AWS since 2017, and we aim to build on top of serverless services whenever possible.

The digital publishing frontend application Storm is a single-page application built using React and Material Design and deployed using Amazon Simple Storage Service (Amazon S3) and Amazon CloudFront. Our CMS backend Nova is implemented using Amazon API Gateway and several AWS Lambda functions. Amazon DynamoDB serves as the primary database for 20 Minutes articles. New articles and changes to existing articles are captured using DynamoDB Streams, which invokes processing logic in AWS Step Functions and feeds our search service based on Amazon OpenSearch.

We integrate Amazon Bedrock using AWS PrivateLink, which allows us to create a private connection between our Amazon Virtual Private Cloud (VPC) and Amazon Bedrock without traversing the public internet.

20 Minutes architecture diagramWhen working on articles in Storm, journalists have access to several AI tools implemented using Amazon Bedrock. Storm is a block-based editor that allows journalists to combine multiple blocks of content, such as title, lede, text, image, social media quotes, and more, into a complete article. With Amazon Bedrock, journalists can use AI to generate an article summary suggestion block and place it directly into the article. We use a single-shot prompt with the full article text in context to generate the summary.

Storm CMS also gives journalists suggestions for article metadata. This includes recommendations for appropriate categories, tags, and even in-text links. These references to other 20 Minutes content are critical to increasing audience engagement, as search engines rank content with relevant internal and external links higher.

To implement this, we use a combination of Amazon Comprehend and Amazon Bedrock to extract the most relevant terms from an article’s text and then perform a search against our internal taxonomic database in OpenSearch. Based on the results, Storm provides several suggestions of terms that should be linked to other articles or topics, which users can accept or reject.

20 Minutes summary generation feature

News dispatches become available in Storm as soon as we receive them from our partners such as AFP. Journalists can browse the dispatches and select them for republication on 20minutes.fr. Every dispatch is manually reworked by our journalists before publication. To do so, journalists first invoke a rewrite of the article by an LLM using Amazon Bedrock. For this, we use a low-temperature single-shot prompt that instructs the LLM not to reinterpret the article during the rewrite, and to keep the word count and structure as similar as possible. The rewritten article is then manually edited by a journalist in Storm like any other article.

To implement our new brand safety feature, we process every new article published on 20minutes.fr. Currently, we use a single shot prompt that includes both the article text and the IAB brand safety guidelines in context to get a sentiment assessment from the LLM. We then parse the response, store the sentiment, and make it publicly available for each article to be accessed by ad servers.

Lessons learned and outlook

When we started working on generative AI use cases at 20 Minutes, we were surprised at how quickly we were able to iterate on features and get them into production. Thanks to the unified Amazon Bedrock API, it’s easy to switch between models for experimentation and find the best model for each use case.

For the use cases described above, we use Anthropic’s Claude in Amazon Bedrock as our primary LLM because of its overall high quality and, in particular, its quality in recognizing French prompts and generating French completions. Because 20 Minutes content is almost exclusively French, these multilingual capabilities are key for us. We have found that careful prompt engineering is a key success factor and we closely adhere to Anthropic’s prompt engineering resources to maximize completion quality.

Even without relying on approaches like fine-tuning or retrieval-augmented generation (RAG) to date, we can implement use cases that deliver real value to our journalists. Based on data collected from our newsroom journalists, our AI tools save them an average of eight minutes per article. With around 160 pieces of content published every day, this is already a significant amount of time that can now be spent reporting the news to our readers, rather than performing repetitive manual tasks.

The success of these use cases depends not only on technical efforts, but also on close collaboration between our product, engineering, newsroom, marketing, and legal teams. Together, representatives from these roles make up our AI Committee, which establishes clear policies and frameworks to ensure the transparent and responsible use of AI at 20 Minutes. For example, every use of AI is discussed and approved by this committee, and all AI-generated content must undergo human validation before being published.

We believe that generative AI is still in its infancy when it comes to digital publishing, and we look forward to bringing more innovative use cases to our platform this year. We’re currently working on deploying fine-tuned LLMs using Amazon Bedrock to accurately match the tone and voice of our publication and further improve our brand safety analysis capabilities. We also plan to use Bedrock models to tag our existing image library and provide automated suggestions for article images.

Why Amazon Bedrock?

Based on our evaluation of several generative AI model providers and our experience implementing the use cases described above, we selected Amazon Bedrock as our primary provider for all our foundation model needs. The key reasons that influenced this decision were:

  1. Choice of models: The market for generative AI is evolving rapidly, and the AWS approach of working with multiple leading model providers ensures that we have access to a large and growing set of foundational models through a single API.
  2. Inference performance: Amazon Bedrock delivers low-latency, high-throughput inference. With on-demand and provisioned throughput, the service can consistently meet all of our capacity needs.
  3. Private model access: We use AWS PrivateLink to establish a private connection to Amazon Bedrock endpoints without traversing the public internet, ensuring that we maintain full control over the data we send for inference.
  4. Integration with AWS services: Amazon Bedrock is tightly integrated with AWS services such as AWS Identity and Access Management (IAM) and the AWS Software Development Kit (AWS SDK). As a result, we were able to quickly integrate Bedrock into our existing architecture without having to adapt any new tools or conventions.

Conclusion and outlook

In this blog post, we described how 20 Minutes is using generative AI on Amazon Bedrock to empower our journalists in the newsroom, reach a broader audience, and make brand safety transparent to our advertisers. With these use cases, we’re using generative AI to bring more value to our journalists today, and we’ve built a foundation for promising new AI use cases in the future.

To learn more about Amazon Bedrock, start with Amazon Bedrock Resources for documentation, blog posts, and more customer success stories.


About the authors

Aurélien CapdecommeAurélien Capdecomme is the Chief Technology Officer at 20 Minutes, where he leads the IT development and infrastructure teams. With over 20 years of experience in building efficient and cost-optimized architectures, he has a strong focus on serverless strategy, scalable applications and AI initiatives. He has implemented innovation and digital transformation strategies at 20 Minutes, overseeing the complete migration of digital services to the cloud.

Bertrand d'AureBertrand d’Aure is a software developer at 20 Minutes. An engineer by training, he designs and implements the backend of 20 Minutes applications, with a focus on the software used by journalists to create their stories. Among other things, he is responsible for adding generative AI features to the software to simplify the authoring process.

Dr. Pascal VogelDr. Pascal Vogel is a Solutions Architect at Amazon Web Services. He collaborates with enterprise customers across EMEA to build cloud-native solutions with a focus on serverless and generative AI. As a cloud enthusiast, Pascal loves learning new technologies and connecting with like-minded customers who want to make a difference in their cloud journey.

Read More

Efficient and cost-effective multi-tenant LoRA serving with Amazon SageMaker

Efficient and cost-effective multi-tenant LoRA serving with Amazon SageMaker

In the rapidly evolving landscape of artificial intelligence (AI), the rise of generative AI models has ushered in a new era of personalized and intelligent experiences. Organizations are increasingly using the power of these language models to drive innovation and enhance their services, from natural language processing to content generation and beyond.

Using generative AI models in the enterprise environment, however, requires taming their intrinsic power and enhancing their skills to address specific customer needs. In cases where an out-of-the-box model is missing knowledge of domain- or organization-specific terminologies, a custom fine-tuned model, also called a domain-specific large language model (LLM), might be an option for performing standard tasks in that domain or micro-domain. BloombergGPT is an example of LLM that was trained from scratch to have a better understanding of highly specialized vocabulary found in the financial domain. In the same sense, domain specificity can be addressed through fine-tuning at a smaller scale. Customers are fine-tuning generative AI models based on domains including finance, sales, marketing, travel, IT, HR, finance, procurement, healthcare and life sciences, customer service, and many more. Additionally, independent software vendors (ISVs) are building secure, managed, multi-tenant, end-to-end generative AI platforms with models that are customized and personalized based on their customer’s datasets and domains. For example, Forethought introduced SupportGPT, a generative AI platform for customer support.

As the demands for personalized and specialized AI solutions grow, businesses often find themselves grappling with the challenge of efficiently managing and serving a multitude of fine-tuned models across diverse use cases and customer segments. With the need to serve a wide range of AI-powered use cases, from resume parsing and job skill matching, domain-specific to email generation and natural language understanding, these businesses are often left with the daunting task of managing hundreds of fine-tuned models, each tailored to specific customer needs or use cases. The complexities of this challenge are compounded by the inherent scalability and cost-effectiveness concerns that come with deploying and maintaining such a diverse model ecosystem. Traditional approaches to model serving can quickly become unwieldy and resource intensive, leading to increased infrastructure costs, operational overhead, and potential performance bottlenecks.

Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage and switching cost for hosting independent instances for different tasks. LoRA (Low-Rank Adaptation) is an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining high model quality. Importantly, it allows for quick task switching when deployed as a service by sharing the vast majority of the model parameters.

In this post, we explore a solution that addresses these challenges head-on using LoRA serving with Amazon SageMaker. By using the new performance optimizations of LoRA techniques in SageMaker large model inference (LMI) containers along with inference components, we demonstrate how organizations can efficiently manage and serve their growing portfolio of fine-tuned models, while optimizing costs and providing seamless performance for their customers.

The latest SageMaker LMI container offers unmerged-LoRA inference, sped up with our LMI-Dist inference engine and OpenAI style chat schema. To learn more about LMI, refer to LMI Starting Guide, LMI handlers Inference API Schema, and Chat Completions API Schema.

New LMI features for serving LoRA adapters at scale on SageMaker

There are two kinds of LoRA that can be put onto various engines:

  • Merged LoRA – This applies the adapter by modifying the base model in place. It has zero added latency while running, but has a cost to apply or unapply the merge. It works best for cases with only a few adapters. It is best for single-adapter batches, and doesn’t support multi-adapter batches.
  • Unmerged LoRA – This alters the model operators to factor in the adapters without changing the base model. It has a higher inference latency for the additional adapter operations. However, it does support multi-adapter batches. It works best for use cases with a large number of adapters.

The new LMI container offers out-of-box integration and abstraction with SageMaker for hosting multiple unmerged LoRA adapters with higher performance (low latency and high throughput) using the vLLM backend LMI-Dist backend that uses vLLM, which in-turn uses S-LORA and Punica. The LMI container offers two backends for serving LoRA adapters: the LMI-Dist backend (recommended) and the vLLM Backend. Both backends are based on the open source vLLM library for serving LoRA adapters, but the LMI-Dist backend provides additional optimized continuous (rolling) batching implementation. You are not required to configure these libraries separately; the LMI container provides the higher-level abstraction through the vLLM and LMI-Dist backends. We recommend you start with the LMI-Dist backend because it has additional performance optimizations related to continuous (rolling) batching.

S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes unified paging. Unified paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead.

Punica is designed to efficiently serve multiple LoRA models on a shared GPU cluster. It achieves this by following three design guidelines:

  • Consolidating multi-tenant LoRA serving workloads to a small number of GPUs to increase overall GPU utilization
  • Enabling batching for different LoRA models to improve performance and GPU utilization
  • Focusing on the decode stage performance, which is the predominant factor in the cost of model serving

Punica uses a new CUDA kernel design called Segmented Gather Matrix-Vector Multiplication (SGMV) to batch GPU operations for concurrent runs of multiple LoRA models, significantly improving GPU efficiency in terms of memory and computation. Punica also implements a scheduler that routes requests to active GPUs and migrates requests for consolidation, optimizing GPU resource allocation. Overall, Punica achieves high throughput and low latency in serving multi-tenant LoRA models on a shared GPU cluster. For more information, read the Punica whitepaper.

The following figure shows the multi LoRA adapter serving stack of the LMI container on SageMaker.

As shown in the preceding figure, the LMI container provides the higher-level abstraction through the vLLM and LMI-Dist backends to serve LoRA adapters at scale on SageMaker. As a result, you’re not required to configure the underlying libraries (S-LORA, Punica, or vLLM) separately. However, there might be cases where you want to control some of the performance driving parameters depending on your use case and application performance requirements. The following are the common configuration options the LMI container provides to tune LoRA serving. For more details on configuration options specific to each backend, refer to vLLM Engine User Guide and LMI-Dist Engine User Guide.

option.enable_lora: This config enables support for LoRA adapters.
option.max_loras: This config determines the maximum number of LoRA adapters that can be run at once. Allocates GPU memory for those number adapters.
option.max_lora_rank: This config determines the maximum rank allowed for a LoRA adapter. Set this value to maximum rank of your adapters. Setting a larger value will enable more adapters at a greater memory usage cost.
option.lora_extra_vocab_size: This config determines the maximum additional vocabulary that can be added through a LoRA adapter.
option.max_cpu_loras: This config determines the maximum number of LoRA adapters to cache in memory. All others will be evicted to disk.

Design patterns for serving fine-tuned LLMs at scale

Enterprises grappling with the complexities of managing generative AI models often encounter scenarios where a robust and flexible design pattern is crucial. One common use case involves a single base model with multiple LoRA adapters, each tailored to specific customer needs or use cases. This approach allows organizations to use a foundational language model while maintaining the agility to fine-tune and deploy customized versions for their diverse customer base.

Single-base model with multiple fine-tuned LoRA adapters

An enterprise offering a resume parsing and job skill matching service may use a single high-performance base model, such as Mistral 7B. The Mistral 7B base model is particularly well-suited for job-related content generation tasks, such as creating personalized job descriptions and tailored email communications. Mistral’s strong performance in natural language generation and its ability to capture industry-specific terminology and writing styles make it a valuable asset for such an enterprise’s customers in the HR and recruitment space. By fine-tuning Mistral 7B with LoRA adapters, enterprises can make sure the generated content aligns with the unique branding, tone, and requirements of each customer, delivering a highly personalized experience.

Multi-base models with multiple fine-tuned LoRA adapters

On the other hand, the same enterprise may use the Llama 3 base model for more general natural language processing tasks, such as resume parsing, skills extraction, and candidate matching. Llama 3’s broad knowledge base and robust language understanding capabilities enable it to handle a wide range of documents and formats, making sure their services can effectively process and analyze candidate information, regardless of the source. By fine-tuning Llama 3 with LoRA adapters, such enterprises can tailor the model’s performance to specific customer requirements, such as regional dialects, industry-specific terminology, or unique data formats. By employing a multi-base model, multi-adapter design pattern, enterprises can take advantage of the unique strengths of each language model to deliver a comprehensive and highly personalized job profile to a candidate resume matching service. This approach allows enterprises to cater to the diverse needs of their customers, making sure each client receives tailored AI-powered solutions that enhance their recruitment and talent management processes.

Effectively implementing and managing these design patterns, where multiple base models are coupled with numerous LoRA adapters, is a key challenge that enterprises must address to unlock the full potential of their generative AI investments. A well-designed and scalable approach to model serving is crucial in delivering cost-effective, high-performance, and personalized experiences to customers.

Solution overview

The following sections outline the coding steps to deploy a base LLM, TheBloke/Llama-2-7B-Chat-fp16, with LoRA adapters on SageMaker. It involves preparing a compressed archive with the base model files and LoRA adapter files, uploading it to Amazon Simple Storage Service (Amazon S3), selecting and configuring the SageMaker LMI container to enable LoRA support, creating a SageMaker endpoint configuration and endpoint, defining an inference component for the model, and sending inference requests specifying different LoRA adapters like Spanish (“es”) and French (“fr”) in the request payload to use those fine-tuned language capabilities. For more information on deploying models using SageMaker inference components, see Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency.

To showcase multi-base models with their LoRA adapters, we add another base model, mistralai/Mistral-7B-v0.1, and its LoRA adapter to the same SageMaker endpoint, as shown in the following diagram.

Prerequisites

You need to complete some prerequisites before you can run the notebook:

Upload your LoRA adapters to Amazon S3

To prepare the LoRA adapters, create a adapters.tar.gz compressed archive containing the LoRA adapters directory. The adapters directory should contain subdirectories for each of the LoRA adapters, with each adapter subdirectory containing the adapter_model.bin file (the adapter weights) and the adapter_config.json file (the adapter configuration). We typically obtain these adapter files by using the PeftModel.save_pretrained() method from the Peft library. After you assemble the adapters directory with the adapter files, you compress it into a adapters.tar.gz archive and upload it to an S3 bucket for deployment or sharing. We include the LoRA adapters in the adapters directory as follows:

|- model_dir
    |- adapters/
        |--- <adapter_1>/
        |--- <adapter_2>/
        |--- ...
        |--- <adapter_n>/

Download LoRA adapters, compress them, and upload the compressed file to Amazon S3:

snapshot_download("UnderstandLing/llama-2-7b-chat-es", local_dir="llama-lora-multi-adapter/adapters/es", local_dir_use_symlinks=False)
snapshot_download("UnderstandLing/llama-2-7b-chat-fr", local_dir="llama-lora-multi-adapter/adapters/fr", local_dir_use_symlinks=False)
snapshot_download("UnderstandLing/llama-2-7b-chat-ru", local_dir="llama-lora-multi-adapter/adapters/ru", local_dir_use_symlinks=False)
!tar czvf adapters.tar.gz -C llama-lora-multi-adapter .
s3_code_artifact_accelerate = sess.upload_data("adapters.tar.gz", model_bucket, s3_code_prefix)

Select and LMI container and configure LMI to enable LoRA

SageMaker provides optimized containers for LMI that support different frameworks for model parallelism, allowing the deployment of LLMs across multiple GPUs. For this post, we employ the DeepSpeed container, which encompasses frameworks such as DeepSpeed and vLLM, among others. See the following code:

deepspeed_image_uri = image_uris.retrieve(
    framework="djl-deepspeed",
    region=sess.boto_session.region_name,
    version="0.27.0"
)

env_generation = {"OPTION_MODEL_ID": "TheBloke/Llama-2-7B-Chat-fp16",
                  "OPTION_TRUST_REMOTE_CODE": "true",
                  "OPTION_TENSOR_PARALLEL_DEGREE": "2",
                  "OPTION_ROLLING_BATCH": "lmi-dist",
                  "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
                  "OPTION_DTYPE": "fp16",
                  "OPTION_ENABLE_LORA": "true",
                  "OPTION_GPU_MEMORY_UTILIZATION": "0.8",
                  "OPTION_MAX_LORA_RANK": "64",
                  "OPTION_MAX_CPU_LORAS": "4"
                 }

Create a SageMaker endpoint configuration

Create an endpoint configuration using the appropriate instance type. Set ContainerStartupHealthCheckTimeoutInSeconds to account for the time taken to download the LLM weights from Amazon S3 or the model hub, and the time taken to load the model on the GPUs:

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": initial_instance_count,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": initial_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {
                'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
            },
        },
    ],
)

Create a SageMaker endpoint

Create a SageMaker endpoint based on the endpoint configuration defined in the previous step. You use this endpoint for hosting the inference component (model) inference and make invocations.

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)

Create a SageMaker inference component (model)

Now that you have created a SageMaker endpoint, let’s create our model as an inference component. The SageMaker inference component enables you to deploy one or more foundation models (FMs) on the same SageMaker endpoint and control how many accelerators and how much memory is reserved for each FM. See the following code:

model_name = sagemaker.utils.name_from_base("lmi-llama2-7b")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env_generation,
        "ModelDataUrl": s3_code_artifact_accelerate,
    }
)

prefix = sagemaker.utils.unique_name_from_base("lmi-llama2-7b")
inference_component_name = f"{prefix}-inference-component"

sm_client.create_inference_component(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name,
        # "Container": {
        #     "Image": inference_image_uri,
        #     "ArtifactUrl": s3_code_artifact,
        # },
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": 1200,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1200,
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 2,
            "MinMemoryRequiredInMb": 7*2*1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

Make inference requests using different LoRA adapters

With the endpoint and inference model ready, you can now send requests to the endpoint using the LoRA adapters you fine-tuned for Spanish and French languages. The specific LoRA adapter is specified in the request payload under the "adapters" field. We use "es" for the Spanish language adapter and "fr" for the French language adapter, as shown in the following code:

# Testing Spanish (es) adapter
response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Piensa en una excusa creativa para decir que no necesito ir a la fiesta."],
                     "adapters": ["es"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

# Testing French (fr) adapter
response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Pensez à une excuse créative pour dire que je n'ai pas besoin d'aller à la fête."],
                     "adapters": ["fr"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

# Testing Russian (ru) adapter
response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Придумайте креативное "],
                     "parameters": params,
                     "adapters": ["ru"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

Add another base model and inference component and its LoRA adapter

Let’s add another base model and its LoRA adapter to the same SageMaker endpoint for multi-base models with multiple fine-tuned LoRA adapters. The code is very similar to the previous code for creating the Llama base model and its LoRA adapter.

Configure the SageMaker LMI container to host the base model (mistralai/Mistral-7B-v0.1) and its LoRA adapter (mistral-lora-multi-adapter/adapters/fr):

deepspeed_image_uri = image_uris.retrieve(
    framework="djl-deepspeed",
    region=sess.boto_session.region_name,
    version="0.27.0"
)

my_hf_token = "<YOUR_HuggingFacePersonalAccessToken_HERE>"

env_generation = {"HF_TOKEN": my_hf_token,
                  "OPTION_MODEL_ID": "mistralai/Mistral-7B-v0.1",
                  "OPTION_TRUST_REMOTE_CODE": "true",
                  "OPTION_TENSOR_PARALLEL_DEGREE": "2",
                  "OPTION_ENABLE_LORA": "true",
                  "OPTION_GPU_MEMORY_UTILIZATION": "0.8",
                  "OPTION_MAX_LORA_RANK": "64",
                  "OPTION_MAX_CPU_LORAS": "4"
                 }

Create a new SageMaker model and inference component for the base model (mistralai/Mistral-7B-v0.1) and its LoRA adapter (mistral-lora-multi-adapter/adapters/fr):

model_name2 = sagemaker.utils.name_from_base("lmi-mistral-7b")

create_model_response = sm_client.create_model(
    ModelName=model_name2,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env,
        "ModelDataUrl": s3_code_artifact_accelerate,
    }
)

sm_client.create_inference_component(
    InferenceComponentName=inference_component_name2,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name2,
        # "Container": {
        #     "Image": inference_image_uri,
        #     "ArtifactUrl": s3_code_artifact,
        # },
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1200,
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 2,
            "MinMemoryRequiredInMb": 7*2*1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

Invoke the same SageMaker endpoint for the newly created inference component for the base model (mistralai/Mistral-7B-v0.1) and its LoRA adapter (mistral-lora-multi-adapter/adapters/fr):

# Testing French (fr) adapter
response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name2,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Pensez à une excuse créative pour dire que je n'ai pas besoin d'aller à la fête."],
                     "adapters": ["fr"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

Clean up

Delete the SageMaker inference components, models, endpoint configuration, and endpoint to avoid incurring unnecessary costs:

sm_client.delete_inference_component(InferenceComponentName=inference_component_name)
sm_client.delete_inference_component(InferenceComponentName=inference_component_name2)
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)
sm_client.delete_model(ModelName=model_name2)

Conclusion

The ability to efficiently manage and serve a diverse portfolio of fine-tuned generative AI models is paramount if you want your organization to deliver personalized and intelligent experiences at scale in today’s rapidly evolving AI landscape. With the inference capabilities of SageMaker LMI coupled with the performance optimizations of LoRA techniques, you can overcome the challenges of multi-tenant fine-tuned LLM serving. This solution enables you to consolidate AI workloads, batch operations across multiple models, and optimize resource utilization for cost-effective, high-performance delivery of tailored AI solutions to your customers. As demand for specialized AI experiences continues to grow, we’ve shown how the scalable infrastructure and cutting-edge model serving techniques of SageMaker position AWS as a powerful platform for unlocking generative AI’s full potential. To start exploring the benefits of this solution for yourself, we encourage you to use the code example and resources we’ve provided in this post.


About the authors

Michael Nguyen is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

Dhawal PatelDhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Vivek Gangasani is a AI/ML Startup Solutions Architect for Generative AI startups at AWS. He helps emerging GenAI startups build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Read More

Mixtral 8x22B is now available in Amazon SageMaker JumpStart

Mixtral 8x22B is now available in Amazon SageMaker JumpStart

Today, we are excited to announce the Mixtral-8x22B large language model (LLM), developed by Mistral AI, is available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models so you can quickly get started with ML. In this post, we walk through how to discover and deploy the Mixtral-8x22B model.

What is Mixtral 8x22B

Mixtral 8x22B is Mistral AI’s latest open-weights model and sets a new standard for performance and efficiency of available foundation models, as measured by Mistral AI across standard industry benchmarks. It is a sparse Mixture-of-Experts (SMoE) model that uses only 39 billion active parameters out of 141 billion, offering cost-efficiency for its size. Continuing with Mistral AI’s belief in the power of publicly available models and broad distribution to promote innovation and collaboration, Mixtral 8x22B is released under Apache 2.0, making the model available for exploring, testing, and deploying. Mixtral 8x22B is an attractive option for customers selecting between publicly available models and prioritizing quality, and for those wanting a higher quality from mid-sized models, such as Mixtral 8x7B and GPT 3.5 Turbo, while maintaining high throughput.

Mixtral 8x22B provides the following strengths:

  • Multilingual native capabilities in English, French, Italian, German, and Spanish languages
  • Strong mathematics and coding capabilities
  • Capable of function calling that enables application development and tech stack modernization at scale
  • 64,000-token context window that allows precise information recall from large documents

About Mistral AI

Mistral AI is a Paris-based company founded by seasoned researchers from Meta and Google DeepMind. During his time at DeepMind, Arthur Mensch (Mistral CEO) was a lead contributor on key LLM projects such as Flamingo and Chinchilla, while Guillaume Lample (Mistral Chief Scientist) and Timothée Lacroix (Mistral CTO) led the development of LLaMa LLMs during their time at Meta. The trio are part of a new breed of founders who combine deep technical expertise and operating experience working on state-of-the-art ML technology at the largest research labs. Mistral AI has championed small foundational models with superior performance and commitment to model development. They continue to push the frontier of artificial intelligence (AI) and make it accessible to everyone with models that offer unmatched cost-efficiency for their respective sizes, delivering an attractive performance-to-cost ratio. Mixtral 8x22B is a natural continuation of Mistral AI’s family of publicly available models that include Mistral 7B and Mixtral 8x7B, also available on SageMaker JumpStart. More recently, Mistral launched commercial enterprise-grade models, with Mistral Large delivering top-tier performance and outperforming other popular models with native proficiency across multiple languages.

What is SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment. You can now discover and deploy Mixtral-8x22B with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, providing data encryption at rest and in-transit.

SageMaker also adheres to standard security frameworks such as ISO27001 and SOC1/2/3 in addition to complying with various regulatory requirements. Compliance frameworks like General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA), and Payment Card Industry Data Security Standard (PCI DSS) are supported to make sure data handling, storing, and process meet stringent security standards.

SageMaker JumpStart availability is dependent on the model; Mixtral-8x22B v0.1 is currently supported in the US East (N. Virginia) and US West (Oregon) AWS Regions.

Discover models

You can access Mixtral-8x22B foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

From the SageMaker JumpStart landing page, you can search for “Mixtral” in the search box. You will see search results showing Mixtral 8x22B Instruct, various Mixtral 8x7B models, and Dolphin 2.5 and 2.7 models.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You will also find the Deploy button, which you can use to deploy the model and create an endpoint.

SageMaker has seamless logging, monitoring, and auditing enabled for deployed models with native integrations with services like AWS CloudTrail for logging and monitoring to provide insights into API calls and Amazon CloudWatch to collect metrics, logs, and event data to provide information into the model’s resource utilization.

Deploy a model

Deployment starts when you choose Deploy. After deployment finishes, an endpoint has been created. You can test the endpoint by passing a sample inference request payload or selecting your testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in your preferred notebook editor in SageMaker Studio. This will require an AWS Identity and Access Management (IAM) role and policy attached to it to restrict model access. Additionally, if you choose to deploy the model endpoint within SageMaker Studio, you will be prompted to choose an instance type, initial instance count, and maximum instance count. The ml.p4d.24xlarge and ml.p4de.24xlarge instance types are the only instance types currently supported for Mixtral 8x22B Instruct v0.1.

To deploy using the SDK, we start by selecting the Mixtral-8x22b model, specified by the model_id with value huggingface-llm-mistralai-mixtral-8x22B-instruct-v0-1. You can deploy any of the selected models on SageMaker with the following code. Similarly, you can deploy Mixtral-8x22B instruct using its own model ID.

from sagemaker.jumpstart.model import JumpStartModel model = JumpStartModel(model_id=""huggingface-llm-mistralai-mixtral-8x22B-instruct-v0-1") predictor = model.deploy()

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel.

After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {"inputs": "Hello!"} 
predictor.predict(payload)

Example prompts

You can interact with a Mixtral-8x22B model like any standard text generation model, where the model processes an input sequence and outputs predicted next words in the sequence. In this section, we provide example prompts.

Mixtral-8x22b Instruct

The instruction-tuned version of Mixtral-8x22B accepts formatted instructions where conversation roles must start with a user prompt and alternate between user instruction and assistant (model answer). The instruction format must be strictly respected, otherwise the model will generate sub-optimal outputs. The template used to build a prompt for the Instruct model is defined as follows:

<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]]

<s> and </s> are special tokens for beginning of string (BOS) and end of string (EOS), whereas [INST] and [/INST] are regular strings.

The following code shows how you can format the prompt in instruction format:

from typing import Dict, List

def format_instructions(instructions: List[Dict[str, str]]) -> List[str]:
    """Format instructions where conversation roles must alternate user/assistant/user/assistant/..."""
    prompt: List[str] = []
    for user, answer in zip(instructions[::2], instructions[1::2]):
        prompt.extend(["<s>", "[INST] ", (user["content"]).strip(), " [/INST] ", (answer["content"]).strip(), "</s>"])
    prompt.extend(["<s>", "[INST] ", (instructions[-1]["content"]).strip(), " [/INST] ","</s>"])
    return "".join(prompt)


def print_instructions(prompt: str, response: str) -> None:
    bold, unbold = '33[1m', '33[0m'
    print(f"{bold}> Input{unbold}n{prompt}nn{bold}> Output{unbold}n{response[0]['generated_text']}n")

Summarization prompt

You can use the following code to get a response for a summarization:

instructions = [{"role": "user", "content": """Summarize the following information. Format your response in short paragraph.

Article:

Contextual compression - To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction.
"""}]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 1500}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

The following is an example of the expected output:

> > Input
<s>[INST] Summarize the following information. Format your response in short paragraph.

Article:

Contextual compression - To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction. [/INST] </s>
> Output
<s>[INST] Summarize the following information. Format your response in short paragraph.

Article:

Contextual compression - To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction. [/INST] </s>Contextual compression is a technique used to manage the issue of context overflow in information retrieval. This method involves compressing and filtering the retrieved documents to align with the query's context, ensuring that only relevant information is processed. This is achieved through a two-step process: a base retriever fetches the initial set of documents, and a document compressor refines these documents by either reducing their content or excluding them entirely based on relevance.

The contextual compression retriever significantly improves the efficiency of Retrieval-Augmented Generation (RAG) applications by extracting and utilizing only the essential information from a large pool of data. This approach addresses the problem of information overload and irrelevant data processing, leading to improved response quality, cost-effective operations, and a smoother retrieval process. In essence, contextual compression acts as a filter that tailors the information to the specific query, making it an indispensable tool for developers aiming to optimize their RAG applications for better performance and user satisfaction.

Multilingual translation prompt

You can use the following code to get a response for a multilingual translation:

Prompt

instructions = [{"role": "user", "content": """
<You are a multilingual assistant. Translate the following sentences in the order in which they are presented into French, German, and Spanish. Make sure to label each section as French, German, and Spanish. [/INST]

1.	Customer: "I recently ordered a set of wireless headphones, but I received a different model. What steps should I take to receive the correct product I ordered?"
2.	Customer: "I purchased a customizable laptop last month and opted for specific upgrades. However, the laptop's performance isn't as expected. Can I have a technician look into it, or should I consider returning it?"
3.	Customer: "My order for a designer handbag was supposed to include a matching wallet as part of a promotional deal, but the wallet was not in the package. How can this issue be resolved?"
4.	Customer: "I see that the tracking information for my order of ceramic cookware shows it was delivered, but I haven't received it. Could you assist in determining where my package might be?"
5.	Customer: "I'm trying to buy an antique mirror from your vintage collection, but the website keeps giving me an error when I try to check out. Is there another way to complete my purchase?" 
"""}]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 2000, "do_sample": True}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

The following is an example of the expected output:

> Input
<s>[INST] <You are a multilingual assistant. Translate the following sentences in the order in which they are presented into French, German, and Spanish. Make sure to label each section as French, German, and Spanish. [/INST]


1. Customer: "I recently ordered a set of wireless headphones, but I received a different model. What steps should I take to receive the correct product I ordered?"
2. Customer: "I purchased a customizable laptop last month and opted for specific upgrades. However, the laptop's performance isn't as expected. Can I have a technician look into it, or should I consider returning it?"
3. Customer: "My order for a designer handbag was supposed to include a matching wallet as part of a promotional deal, but the wallet was not in the package. How can this issue be resolved?"
4. Customer: "I see that the tracking information for my order of ceramic cookware shows it was delivered, but I haven't received it. Could you assist in determining where my package might be?"
5. Customer: "I'm trying to buy an antique mirror from your vintage collection, but the website keeps giving me an error when I try to check out. Is there another way to complete my purchase?" [/INST] </s>
> Output
<s>[INST] <You are a multilingual assistant. Translate the following sentences in the order in which they are presented into French, German, and Spanish. Make sure to label each section as French, German, and Spanish. [/INST]


1. Customer: "I recently ordered a set of wireless headphones, but I received a different model. What steps should I take to receive the correct product I ordered?"
2. Customer: "I purchased a customizable laptop last month and opted for specific upgrades. However, the laptop's performance isn't as expected. Can I have a technician look into it, or should I consider returning it?"
3. Customer: "My order for a designer handbag was supposed to include a matching wallet as part of a promotional deal, but the wallet was not in the package. How can this issue be resolved?"
4. Customer: "I see that the tracking information for my order of ceramic cookware shows it was delivered, but I haven't received it. Could you assist in determining where my package might be?"
5. Customer: "I'm trying to buy an antique mirror from your vintage collection, but the website keeps giving me an error when I try to check out. Is there another way to complete my purchase?" [/INST] </s>

French:

1. Client : "J'ai récemment commandé un casque audio sans fil, mais j'ai reçu un modèle différent. Quelles sont les étapes à suivre pour recevoir le produit correct que j'ai commandé ?"
2. Client : "J'ai acheté un ordinateur portable personnalisable le mois dernier et j'ai opté pour des mises à niveau spécifiques. Cependant, les performances de l'ordinateur portable ne sont pas à la hauteur de mes attentes. Puis-je avoir un technicien qui vérifie cela, ou devrais-je envisager de le retourner ?"
3. Client : "Ma commande pour un sac à main de designer devait inclure un portefeuille assorti dans le cadre d'une offre promotionnelle, mais le portefeuille ne se trouvait pas dans le paquet. Comment puis-je résoudre ce problème ?"
4. Client : "Je vois que les informations de suivi de ma commande de batterie de cuisine en céramique indiquent qu'elle a été livrée, mais je ne l'ai pas reçue. Pourriez-vous m'aider à déterminer où se trouve mon colis ?"
5. Client : "J'essaie d'acheter un miroir antique de votre collection vintage, mais le site continue de me donner une erreur lorsque j'essaie de passer à la caisse. Existe-t-il un autre moyen de finaliser mon achat ?"

German:

1. Kunde: "Ich habe kürzlich ein Set kabelloser Kopfhörer bestellt, aber ich habe ein anderes Modell erhalten. Welche Schritte sollte ich unternehmen, um das richtige Produkt zu erhalten, das ich bestellt habe?"
2. Kunde: "Ich habe letzten Monat einen anpassbaren Laptop gekauft und habe mich für spezifische Upgrades entschieden. Allerdings entspricht die Leistung des Laptops nicht meinen Erwartungen. Kann ich einen Techniker hinzuziehen lassen oder sollte ich eine Rückgabe in Erwägung ziehen?"
3. Kunde: "Meine Bestellung für eine Designer-Handtasche sollte inklusive eines passenden Portemonnaies als Teil einer Werbeaktion sein, aber das Portemonnaie war nicht im Paket. Wie kann dieses Problem gelöst werden?"
4. Kunde: "Ich sehe, dass die Sendungsverfolgungsinformationen für meine Bestellung von Keramik-Kochgeschirr anzeigen, dass es geliefert wurde, aber ich habe es nicht erhalten. Könnten Sie mir dabei helfen, festzustellen, wo sich mein Paket befindet?"
5. Kunde: "Ich versuche ein Antikspiegel aus Ihrer Vintage-Kollektion zu kaufen, aber die Website gibt mir jedes Mal einen Fehler, wenn ich versuche, den Bestellvorgang abzuschließen. Gibt es einen anderen Weg, meinen Kauf abzuschließen?"

Spanish:

1. Cliente: "Recientemente ordené un conjunto de audífonos inalámbricos, pero recibí un modelo diferente. ¿Cuáles son los pasos que debo seguir para recibir el producto correcto que ordené?"
2. Cliente: "Compré una computadora personalizable el mes pasado y opté por actualizaciones específicas. Sin embargo, el rendimiento de la computadora no está a la altura de mis expectativas. ¿Puedo tener un técnico que revise esto, o debería considerar devolverla?"
3. Cliente: "Mi pedido de un bolso de diseñador debería haber incluido una billetera a juego como parte de una oferta promocional, pero la billetera no estaba en el paquete. ¿Cómo puedo resolver este problema?"
4. Cliente: "Veo que la información de seguimiento de mi pedido de utensilios de cocina de cerámica indica que ha sido entregado, pero aún no lo he recibido. ¿Podría ayudarme a determinar dónde se encuentra mi paquete?"
5. Cliente: "Estoy tratando de comprar un espejo antiguo de su colección de vintage, pero el sitio sigue dándome un error cada vez que trato de realizar el pago. ¿Hay otra forma de completar mi compra?"

Code generation

You can use the following code to get a response for code generation:

instructions = [{"role": "user", "content": "Write a program on Dijstra's algorithm using a priority queue in python:"}]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 1000}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

You get the following output:

> Input
<s>[INST] Write a program on Dijstra's algorithm using a priority queue in python: [/INST] </s>

> Output
<s>[INST] Write a program on Dijstra's algorithm using a priority queue in python: [/INST] </s>Here is a Python implementation of Dijkstra's algorithm using a priority queue (heapq module):
```python
import heapq

def dijkstra(graph, start):
    # Initialize the priority queue with the starting vertex and its distance (0)
    priority_queue = [(0, start)]
    # Initialize the distances dictionary with infinite distances for all vertices except the starting vertex
    distances = {vertex: float('inf') for vertex in graph}
    distances[start] = 0
    # Initialize the previous dictionary to keep track of the shortest path
    previous = {vertex: None for vertex in graph}

    while priority_queue:
        # Get the vertex with the smallest distance from the priority queue
        current_distance, current_vertex = heapq.heappop(priority_queue)

        # If the current distance is greater than the stored distance, skip this vertex
        if current_distance > distances[current_vertex]:
            continue

        # Update the distances of the adjacent vertices
        for neighbor, weight in graph[current_vertex].items():
            distance = current_distance + weight
            # If the new distance is smaller than the stored distance, update the distance and previous vertex
            if distance < distances[neighbor]:
                distances[neighbor] = distance
                previous[neighbor] = current_vertex
                # Add the neighbor to the priority queue
                heapq.heappush(priority_queue, (distance, neighbor))

    return distances, previous

# Example usage:
graph = {
    'A': {'B': 1, 'C': 4},
    'B': {'A': 1, 'C': 2, 'D': 5},
    'C': {'A': 4, 'B': 2, 'D': 1},
    'D': {'B': 5, 'C': 1}
}

distances, previous = dijkstra(graph, 'A')
print(distances)  # {'A': 0, 'B': 1, 'C': 3, 'D': 4}
```
In this implementation, the `graph` parameter is a dictionary where the keys are the vertices and the values are dictionaries that map adjacent vertices to their weights. The `start` parameter is the starting vertex for the algorithm. The function returns two dictionaries: `distances` and `previous`. The `distances` dictionary maps each vertex to its shortest distance from the starting vertex, and the `previous` dictionary maps each vertex to its previous vertex in the shortest path.

Reasoning and math

You can use the following code to get a response for reasoning and math:


instructions = [{"role": "user", "content": """Sarah went to a car shop to buy a car from Fred for $20,000 in 2024. She plans to sell the car but it depreciates by 2% each year after she has bought it. She went to a dealer in which that dealer told her that the car has only depreciated by 1.4% each year. After 7 years of using the car, Sarah decides to sell it directly to another person.

How much did Sarah sell the car for and what year is it? Explain the steps before answering. It's ok to make some assumptions as you come to your answer.

"""}]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 2000, "do_sample": True}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

You get the following output:

<s>[INST] Sarah went to a car shop to buy a car from Fred for $20,000 in 2024. She plans to sell the car but it depreciates by 2% each year after she has bought it. She went to a dealer in which that dealer told her that the car has only depreciated by 1.4% each year. After 7 years of using the car, Sarah decides to sell it directly to another person. 

How much did Sarah sell the car for and what year is it? Explain the steps before answering. It's ok to make some assumptions as you come to your answer. [/INST] </s>To solve this problem, we need to calculate the value of the car after 7 years, assuming two different annual depreciation rates: 2% and 1.4%.

First, let's calculate the value of the car after 7 years assuming a 2% annual depreciation rate:

Year 1: $20,000 - ($20,000 * 2%) = $19,600
Year 2: $19,600 - ($19,600 * 2%) = $19,208
Year 3: $19,208 - ($19,208 * 2%) = $18,821.60
Year 4: $18,821.60 - ($18,821.60 * 2%) = $18,442.22
Year 5: $18,442.22 - ($18,442.22 * 2%) = $18,071.62
Year 6: $18,071.62 - ($18,071.62 * 2%) = $17,709.76
Year 7: $17,709.76 - ($17,709.76 * 2%) = $17,356.58

So, assuming a 2% annual depreciation rate, the value of the car after 7 years would be approximately $17,356.58.

Next, let's calculate the value of the car after 7 years assuming a 1.4% annual depreciation rate:

Year 1: $20,000 - ($20,000 * 1.4%) = $19,728
Year 2: $19,728 - ($19,728 * 1.4%) = $19,463.23
Year 3: $19,463.23 - ($19,463.23 * 1.4%) = $19,205.80
Year 4: $19,205.80 - ($19,205.80 * 1.4%) = $18,955.66
Year 5: $18,955.66 - ($18,955.66 * 1.4%) = $18,712.73
Year 6: $18,712.73 - ($18,712.73 * 1.4%) = $18,476.95
Year 7: $18,476.95 - ($18,476.95 * 1.4%) = $18,248.33

So, assuming a 1.4% annual depreciation rate, the value of the car after 7 years would be approximately $18,248.33.

Now, since Sarah sells the car directly to another person, we can assume that she sold it for the higher of the two values. Therefore, she sold the car for $18,248.33.

As for the year in which she sold the car, since we know that Sarah purchased the car in 2024 and used it for 7 years, we can calculate the year as follows:

2024 + 7 = 2031

Therefore, Sarah sold the car in the year 2031.

So, Sarah sold the car for $18,248.33 in the year 2031.

Clean up

After you’re done running the notebook, delete all resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Mixtral-8x22B in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.

Now that you are aware of Mistral AI and their Mixtral 8x22B models, we encourage you to deploy an endpoint on SageMaker to perform inference testing and try out responses for yourself. Refer to the following resources for more information:


About the Authors

Marco Punio is a Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyper-scale on AWS. He is a qualified technologist with a passion for machine learning, artificial intelligence, and mergers and acquisitions. Marco is based in Seattle, WA, and enjoys writing, reading, exercising, and building applications in his free time.

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

June Won is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping application and last mile delivery.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Shane Rai is a Principal GenAI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using AWS’s breadth of cloud-based AI/ML services including model offerings from top tier foundation model providers.

Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his master’s from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.

Read More

Building Generative AI prompt chaining workflows with human in the loop

Building Generative AI prompt chaining workflows with human in the loop

Generative AI is a type of artificial intelligence (AI) that can be used to create new content, including conversations, stories, images, videos, and music. Like all AI, generative AI works by using machine learning models—very large models that are pretrained on vast amounts of data called foundation models (FMs). FMs are trained on a broad spectrum of generalized and unlabeled data. They’re capable of performing a wide variety of general tasks with a high degree of accuracy based on input prompts. Large language models (LLMs) are one class of FMs. LLMs are specifically focused on language-based tasks such as summarization, text generation, classification, open-ended conversation, and information extraction.

FMs and LLMs, even though they’re pre-trained, can continue to learn from data inputs or prompts during inference. This means that you can develop comprehensive outputs through carefully curated prompts. A prompt is the information you pass into an LLM to elicit a response. This includes task context, data that you pass to the model, conversation and action history, instructions, and even examples. The process of designing and refining prompts to get specific responses from these models is called prompt engineering.

While LLMs are good at following instructions in the prompt, as a task gets complex, they’re known to drop tasks or perform a task not at the desired accuracy. LLMs can handle complex tasks better when you break them down into smaller subtasks. This technique of breaking down a complex task into subtasks is called prompt chaining. With prompt chaining, you construct a set of smaller subtasks as individual prompts. Together, these subtasks make up the overall complex task. To accomplish the overall task, your application feeds each subtask prompt to the LLM in a pre-defined order or according to a set of rules.

While Generative AI can create highly realistic content, including text, images, and videos, it can also generate outputs that appear plausible but are verifiably incorrect. Incorporating human judgment is crucial, especially in complex and high-risk decision-making scenarios. This involves building a human-in-the-loop process where humans play an active role in decision making alongside the AI system.

In this blog post, you will learn about prompt chaining, how to break a complex task into multiple tasks to use prompt chaining with an LLM in a specific order, and how to involve a human to review the response generated by the LLM.

Example overview

To illustrate this example, consider a retail company that allows purchasers to post product reviews on their website. By responding promptly to those reviews, the company demonstrates its commitments to customers and strengthens customer relationships.

Figure 1: Customer review and response

The example application in this post automates the process of responding to customer reviews. For most reviews, the system auto-generates a reply using an LLM. However, if the review or LLM-generated response contains uncertainty around toxicity or tone, the system flags it for a human reviewer. The human reviewer then assesses the flagged content to make the final decision about the toxicity or tone.

The application uses event-driven architecture (EDA), a powerful software design pattern that you can use to build decoupled systems by communicating through events. As soon as the product review is created, the review receiving system uses Amazon EventBridge to send an event that a product review is posted, along with the actual review content. The event starts an AWS Step Functions workflow. The workflow runs through a series of steps including generating content using an LLM and involving human decision making.

Figure 2: Review workflow

The process of generating a review response includes evaluating the toxicity of the review content, identifying sentiment, generating a response, and involving a human approver. This naturally fits into a workflow type of application because it’s a single process containing multiple sequential steps along with the need to manage state between steps. Hence the example uses Step Functions for workflow orchestration. Here are the steps in the review response workflow.

  1. Detect if the review content has any harmful information using the Amazon Comprehend DetectToxicContent API. The API responds with the toxicity score that represents the overall confidence score of detection between 0 and 1 with score closer to 1 indicating high toxicity.
  2. If toxicity of the review is in the range of 0.4 – 0.6, send the review to a human reviewer to make the decision.
  3. If the toxicity of the review is greater than 0.6 or the reviewer finds the review harmful, publish HARMFUL_CONTENT_DETECTED message.
  4. If the toxicity of the review is less than 0.4 or reviewer approves the review, find the sentiment of the review first and then generate the response to the review comment. Both tasks are achieved using a generative AI model.
  5. Repeat the toxicity detection through the Comprehend API for the LLM generated response.
  6. If the toxicity of the LLM generated response is in the range of 0.4 – 0.6, send the LLM generated response to a human reviewer.
  7. If the LLM generated response is found to be non-toxic, publish NEW_REVIEW_RESPONSE_CREATED event.
  8. If the LLM generated response is found to be toxic, publish RESPONSE_GENERATION_FAILED event.

Figure 3: product review evaluation and response workflow

Getting started

Use the instructions in the GitHub repository to deploy and run the application.

Prompt chaining

Prompt chaining simplifies the problem for the LLM by dividing single, detailed, and monolithic tasks into smaller, more manageable tasks. Some, but not all, LLMs are good at following all the instructions in a single prompt. The simplification results in writing focused prompts for the LLM, leading to a more consistent and accurate response. The following is a sample ineffective single prompt.

Read the below customer review, filter for harmful content and provide your thoughts on the overall sentiment in JSON format. Then construct an email response based on the sentiment you determine and enclose the email in JSON format. Based on the sentiment, write a report on how the product can be improved.

To make it more effective, you can split the prompt into multiple subtasks:

  1. Filter for harmful content
  2. Get the sentiment
  3. Generate the email response
  4. Write a report

You can even run some of the tasks in parallel. By breaking down to focused prompts, you achieve the following benefits:

  • You speed up the entire process. You can handle tasks in parallel, use different models for different tasks, and send response back to the user rather than waiting for the model to process a larger prompt for considerably longer time.
  • Better prompts provide better output. With focused prompts, you can engineer the prompts by adding additional relevant context thus improving the overall reliability of the output.
  • You spend less time developing. Prompt engineering is an iterative process. Both debugging LLM calls for detailed prompt and refining the larger prompt for accuracy require significant time and effort. Smaller tasks enable you to experiment and refine through successive iterations.

Step Functions is a natural fit to build prompt chaining because it offers multiple different ways to chain prompts: sequentially, in parallel, and iteratively by passing the state data from one state to another. Consider the situation where you have built the product review response prompt chaining workflow and now want to evaluate the responses from different LLMs to find the best fit using an evaluation test suite. The evaluation test suite consists of hundreds of test product reviews, a reference response to the review, and a set of rules to evaluate the LLM response against the reference response. You can automate the evaluation activity using a Step Functions workflow. The first task in the workflow asks the LLM to generate a review response for the product review. The second task then asks the LLM to compare the generated response to the reference response using the rules and generate an evaluation score. Based on the evaluation score for each review, you can decide if the LLM passes your evaluation criteria or not. You can use the map state in Step Functions to run the evaluations for each review in your evaluation test suite in parallel. See this repository for more prompt chaining examples.

Human in the loop

Involving human decision making in the example allows you to improve the accuracy of the system when the toxicity of the content cannot be determined to be either safe or harmful. You can implement human review within the Step Functions workflow using Wait for a Callback with the Task Token integration. When you use this integration with any supported AWS SDK API, the workflow task generates a unique token and then pauses until the token is returned. You can use this integration to include human decision making, call a legacy on-premises system, wait for completion of long running tasks, and so on.

"Wait for human approval for product review": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:{region}:{account}:function:human-approval-helper-product-review-response-automation-stage",
        "Payload": {
          "review_text.$": "$$.Execution.Input.review_text",
          "token.$": "$$.Task.Token",
          "api_url": "https://{apiID}.execute-api.{region}.amazonaws.com/dev"
}

In the sample application, the send email for approval task includes a wait for the callback token. It invokes an AWS Lambda function with a token and waits for the token. The Lambda function builds an email message along with the link to an Amazon API Gateway URL. Lambda then uses Amazon Simple Notification Service (Amazon SNS) to send an email to a human reviewer. The reviewer reviews the content and either accepts or rejects the message by selecting the appropriate link in the email. This action invokes the Step Functions SendTaskSuccess API. The API sends back the task token and a status message of whether to accept or reject the review. Step Functions receives the token, resumes the send email for approval task and then passes control to the choice state. The choice state decides whether to go through acceptance or rejection of the review based on the status message.

Figure 4: Human-in-the-loop workflow

Event-driven architecture

EDA enables building extensible architectures. You can add consumers at any time by subscribing to the event. For example, consider moderating images and videos attached to a product review in addition to the text content. You also need to write code to delete the images and videos if they are found harmful. You can add a consumer, the image moderation system, to the NEW_REVIEW_POSTED event without making any code changes to the existing event consumers or producers. Development of the image moderation system and the review response system to delete harmful images can proceed in parallel which in turn improves development velocity.

When the image moderation workflow finds toxic content, it publishes a HARMFULL_CONTENT_DETECTED event. The event can be processed by a review response system that decides what to do with the event. By decoupling systems through events, you gain many advantages including improved development velocity, variable scaling, and fault tolerance.

Figure 5: Event-driven workflow

Cleanup

Use the instructions in the GitHub repository to delete the sample application.

Conclusion

In this blog post, you learned how to build a generative AI application with prompt chaining and a human-review process. You learned how both techniques improve the accuracy and safety of a generative AI application. You also learned how event-driven architectures along with workflows can integrate existing applications with generative AI applications.

Visit Serverless Land for more Step Functions workflows.


About the authors

Veda Raman is a Senior Specialist Solutions Architect for Generative AI and machine learning based at AWS. Veda works with customers to help them architect efficient, secure and scalable machine learning applications. Veda specializes in generative AI services like Amazon Bedrock and Amazon Sagemaker.

Uma Ramadoss is a Principal Solutions Architect at Amazon Web Services, focused on the Serverless and Integration Services. She is responsible for helping customers design and operate event-driven cloud-native applications using services like Lambda, API Gateway, EventBridge, Step Functions, and SQS. Uma has a hands on experience leading enterprise-scale serverless delivery projects and possesses strong working knowledge of event-driven, micro service and cloud architecture.

Read More