Build a multimodal social media content generator using Amazon Bedrock

Build a multimodal social media content generator using Amazon Bedrock

In today’s digital age, social media has revolutionized the way brands interact with their consumers, creating a need for dynamic and engaging content that resonates with their target audience. There’s growing competition for consumer attention in this space; content creators and influencers face constant challenges to produce new, engaging, and brand-consistent content. The challenges come from three key factors: the need for rapid content production, the desire for personalized content that is both captivating and visually appealing and reflects the unique interests of the consumer, and the necessity for content that is consistent with a brand’s identity, messaging, aesthetics, and tone.

Traditionally, the content creation process has been a time-consuming task involving multiple steps such as ideation, research, writing, editing, design, and review. This slow cycle of creation does not fit for the rapid pace of social media.

Generative AI offers new possibilities to address this challenge and can be used by content teams and influencers to enhance their creativity and engagement while maintaining brand consistency. More specifically, multimodal capabilities of large language models (LLMs) allow us to create the rich, engaging content spanning text, images, audio, and video formats that are omnipresent in advertising, marketing, and social media content. With recent advancements in vision LLMs, creators can use visual input, such as reference images, to start the content creation process. Image similarity search and text semantic search further enhance the process by quickly retrieving relevant content and context.

In this post, we walk you through a step-by-step process to create a social media content generator app using vision, language, and embedding models (Anthropic’s Claude 3, Amazon Titan Image Generator, and Amazon Titan Multimodal Embeddings) through Amazon Bedrock API and Amazon OpenSearch Serverless. Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI companies through a single API. OpenSearch Serverless is a fully managed service that makes it easier to store vectors and other data types in an index and allows you to perform sub second query latency when searching billions of vectors and measuring the semantic similarity.

Here’s how the proposed process for content creation works:

  1. First, the user (content team or marketing team) uploads a product image with a simple background (such as a handbag). Then, they provide natural language descriptions of the scene and enhancements they wish to add to the image as a prompt (such as “Christmas holiday decorations”).
  2. Next, Amazon Titan Image Generator creates the enhanced image based on the provided scenario.
  3. Then, we generate rich and engaging text that describes the image while aligning with brand guidelines and tone using Claude 3.
  4. After the draft (text and image) is created, our solution performs multimodal similarity searches against historical posts to find similar posts and gain inspiration and recommendations to enhance the draft post.
  5. Finally, based on the generated recommendations, the post text is further refined and provided to the user on the webpage. The following diagram illustrates the end-to-end new content creation process.

Solution overview

In this solution, we start with data preparation, where the raw datasets can be stored in an Amazon Simple Storage Service (Amazon S3) bucket. We provide a Jupyter notebook to preprocess the raw data and use the Amazon Titan Multimodal Embeddings model to convert the image and text into embedding vectors. These vectors are then saved on OpenSearch Serverless as collections, as shown in the following figure.

Next is the content generation. The GUI webpage is hosted using a Streamlit application, where the user can provide an initial product image and a brief description of how they expect the enriched image to look. From the application, the user can also select the brand (which will link to a specific brand template later), choose the image style (such as photographic or cinematic), and select the tone for the post text (such as formal or casual).

After all the configurations are provided, the content creation process, shown in the following figure, is launched.

In stage 1, the solution retrieves the brand-specific template and guidelines from a CSV file. In a production environment, you could maintain the brand template table in Amazon DynamoDB for scalability, reliability, and maintenance. The user input is used to generate the enriched image with the Amazon Titan Image Generator. Together with all the other information, it’s fed into the Claude 3 model, which has vision capability, to generate the initial post text that closely aligns with the brand guidelines and the enriched image. At the end of this stage, the enriched image and initial post text are created and sent back to the GUI to display to users.

In stage 2, we combine the post text and image and use the Amazon Titan Multimodal Embeddings model to generate the embedding vector. Multimodal embedding models integrate information from different data types, such as text and images, into a unified representation. This enables searching for images using text descriptions, identifying similar images based on visual content, or combining both text and image inputs to refine search results. In this solution, the multimodal embedding vector is used to search and retrieve the top three similar historical posts from the OpenSearch vector store. The retrieved results are fed into the Anthropic’s Claude 3 model to generate a caption, provide insights on why these historical posts are engaging, and offer recommendations on how the user can improve their post.

In stage 3, based on the recommendations from stage 2, the solution automatically refines the post text and provides a final version to the user. The user has the flexibility to select the version they like and make changes before publishing. For the end-to-end content generation process, steps are orchestrated with the Streamlit application.

The whole process is shown in the following image:

Implementation steps

This solution has been tested in AWS Region us-east-1. However, it can also work in other Regions where the following services are available. Make sure you have the following set up before moving forward:

We use Amazon SageMaker Studio to generate historical post embeddings and save those embedding vectors to OpenSearch Serverless. Additionally, you will run the Streamlit app from the SageMaker Studio terminal to visualize and test the solution. Testing the Streamlit app in a SageMaker environment is intended for a temporary demo. For production, we recommend deploying the Streamlit app on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS) services with proper security measures such as authentication and authorization.

We use the following models from Amazon Bedrock in the solution. Please see Model support by AWS Region and select the Region that supports all three models:

  • Amazon Titan Multimodal Embeddings Model
  • Amazon Titan Image Generator
  • Claude 3 Sonnet

Set up a JupyterLab space on SageMaker Studio

JupyterLab space is a private or shared space within Sagemaker Studio that manages the storage and compute resources needed to run the JupyterLab application.

To set up a JupyterLab space

  1. Sign in to your AWS account and open the AWS Management Console. Go to SageMaker Studio.
  2. Select your user profile and choose Open Studio.
  3. From Applications in the top left, choose JupyterLab.
  4. If you already have a JupyterLab space, choose Run. If you do not, choose Create JupyterLab Space to create one. Enter a name and choose Create Space.
  5. Change the instance to t3.large and choose Run Space.
  6. Within a minute, you should see that the JupyterLab space is ready. Choose Open JupyterLab.
  7. In the JupyterLab launcher window, choose Terminal.
  8. Run the following command on the terminal to download the sample code from Github:
    git clone https://github.com/aws-samples/Build-a-multimodal-social-media-content-generator-using-Amazon-Bedrock.git

Generate sample posts and compute multimodal embeddings

In the code repository, we provide some sample product images (bag, car, perfume, and candle) that were created using the Amazon Titan Image Generator model. Next, you can generate some synthetic social media posts using the notebook: synthetic-data-generation.ipynb by using the following steps. The generated posts’ texts are saved in the metadata.jsonl file (if you prepared your own product images and post texts, you can skip this step). Then, compute multimodal embeddings for the pairs of images and generated texts. Finally, ingest the multimodal embeddings into a vector store on Amazon OpenSearch Serverless.

To generate sample posts

  1. In JupyterLab, choose File Browser and navigate to the folder social-media-generator/embedding-generation.
  2. Open the notebook synthetic-data-generation.ipynb.
  3. Choose the default Python 3 kernel and Data Science 3.0 image, then follow the instructions in the notebook.
  4. At this stage, you will have sample posts that are created and available in data_mapping.csv.
  5. Open the notebook multimodal_embedding_generation.ipynb. The notebook first creates the multimodal embeddings for the post-image pair. It then ingests the computed embeddings into a vector store on Amazon OpenSearch Serverless.
  6. At the end of the notebook, you should be able to perform a simple query to the collection as shown in the following example:
query_prompt = "christmas tree, holiday, bags"
similar_items = find_similar_items_from_query(
                    query_prompt = query_prompt, k=3, num_results=5, 
                    index_name=index_name, dataset = df, 
                    open_search_client = oss_client)

The preparation steps are complete. If you want to try out the solution directly, you can skip to Run the solution with Streamlit App to quickly test the solution in your SageMaker environment. However, if you want a more detailed understanding of each step’s code and explanations, continue reading.

Generate a social media post (image and text) using FMs

In this solution, we use FMs through Amazon Bedrock for content creation. We start by enhancing the input product image using the Amazon Titan Image Generator model, which adds a dynamically relevant background around the target product.

The get_titan_ai_request_body function creates a JSON request body for the Titan Image Generator model, using its Outpainting feature. It accepts four parameters: outpaint_prompt (for example, “Christmas tree, holiday decoration” or “Mother’s Day, flowers, warm lights”), negative_prompt (elements to exclude from the generated image), mask_prompt (specifies areas to retain, such as “bag” or “car”), and image_str (the input image encoded as a base64 string).

The generate_image function requires model_id and body (the request body from get_titan_ai_request_body). It invokes the model using bedrock.invoke_model and returns the response containing the base64-encoded generated image.

Finally, the code snippet calls get_titan_ai_request_body with the provided prompts and input image string, then passes the request body to generate_image, resulting in the enhanced image.

def get_titan_ai_request_body(outpaint_prompt, negative_prompt, mask_prompt, image_str=None):
  
    seed = random.randint(0, 2147483647)
    body = {
        "taskType": "OUTPAINTING",
        "outPaintingParams": {
            "text": outpaint_prompt,
            "negativeText": negative_prompt,
            "image": image_str,
            "maskPrompt": mask_prompt,
            "outPaintingMode": "PRECISE" # or DEFAULT
        },
        "imageGenerationConfig": {
            "numberOfImages": 1,
            "quality": "premium",
            "cfgScale": 8,
            "seed": seed,
        }
    }
return json.dumps(body)

def generate_image(model_id, body):
    """
    Args:
    model_id (str): The model ID to use.
    body (str) : The request body to use.
    Returns:
    image_bytes (bytes): The image generated by the model.
    """
    logger.info("Generating image with model %s", model_id)
    
    accept = "application/json"
    content_type = "application/json"
    
    response = bedrock.invoke_model(
        body=body, modelId=model_id, accept=accept, contentType=content_type
    )
    response_body = json.loads(response.get("body").read())
return response_body

body = get_titan_ai_request_body(outpaint_prompt, negative_prompt, mask_prompt, image_str = image_str)
response = generate_image(model_id =MODEL_IMAGE, body = body)
image_enhanced = base64_to_image(response["images"][0])

The following images showcase the enhanced versions generated based on input prompts like “Christmas tree, holiday decoration, warm lights,” a selected position (such as bottom-middle), and a brand (“Luxury Brand”). These settings influence the output images. If the generated image is unsatisfactory, you can repeat the process until you achieve the desired outcome.

Next, generate the post text, taking into consideration the user inputs, brand guidelines (provided in the brand_guideline.csv file, which you can replace with your own data), and the enhanced image generated from the previous step.

The generate_text_with_claude function is the higher-level function that handles the image and text input, prepares the necessary data, and calls generate_vision_answer to interact with the Amazon Bedrock model (Claude 3 models) and receive the desired response. The generate_vision_answer function performs the core interaction with the Amazon Bedrock model, processes the model’s response, and returns it to the caller. Together, they enable generating text responses based on combined image and text inputs.

In the following code snippet, an initial post prompt is constructed using formatting placeholders for various elements such as role, product name, target brand, tone, hashtag, copywriting, and brand messaging. These elements are provided in the brand_guideline.csv file to make sure that the generated text aligns with the brand preferences and guidelines. This initial prompt is then passed to the generate_text_with_claude function, along with the enhanced image to generate the final post text.

def generate_vision_answer(bedrock:boto3.client, messages:list, model_id:str, claude_config:dict,system_prompt:str):
    """
    Generates a vision answer using the specified model and configuration.
    """
    body={'messages': [messages],**claude_config, "system": system_prompt}
    bedrock = boto3.client(service_name='bedrock-runtime')
    
    response = bedrock.invoke_model(modelId=model_id, body=json.dumps(body))   
    response = json.loads(response['body'].read().decode('utf-8'))
    print("Claude vision answer OK")
    formated_response= post_process_answer(response['content'][0]['text'])
    
    return formated_response

def generate_text_with_claude(image, prompt):
    '''
    Generate text with Claude for post generation and historical posts analysis
    '''
    with BytesIO() as byte_io:
        image.save(byte_io, format="PNG")
        image_bytes = byte_io.getvalue()

    messages={"role": "user", "content": [
    {
            "type": "image",
            "source": {
            "type": "base64",
            "media_type": "image/jpeg",
            "data": base64.b64encode(image_bytes).decode(),
            }
    },
    {"type": "text", 
        "text": prompt}
    ]}

    claude_text = generate_vision_answer(bedrock, messages, MODEL_TEXT, CLAUDE_CONFIG, SYSTEM_PROMPT)   
    return claude_text

initial_post_prompt = PROMPT_TEXT.format(
                        role=role, product_name=product_input, target_brand=brand, 
                        tone=tone, hashtag = hashtag, copywriting= copywriting, 
                        brand_messageing = brand_messageing)
        
post_text = generate_text_with_claude(
                    image = image_enhanced, 
                    prompt=initial_post_prompt)

The following example shows the generated post text. It provides a detailed description of the product, aligns well with the brand guidelines, and incorporates elements from the image (such as the Christmas tree). Additionally, we instructed the model to include hashtags and emojis where appropriate, and the results demonstrate that it followed the prompt instructions effectively.

Post text:

Elevate your style with Luxury Brand’s latest masterpiece. Crafted with timeless elegance and superior quality, this exquisite bag embodies unique craftsmanship. Indulge in the epitome of sophistication and let it be your constant companion for life’s grandest moments. 🎄✨ #LuxuryBrand #TimelessElegance #ExclusiveCollection

Retrieve and analyze the top three relevant posts

The next step involves using the generated image and text to search for the top three similar historical posts from a vector database. We use the Amazon Titan Multimodal Embeddings model to create embedding vectors, which are stored in Amazon OpenSearch Serverless. The relevant historical posts, which might have many likes, are displayed on the application webpage to give users an idea of what successful social media posts look like. Additionally, we analyze these retrieved posts and provide actionable improvement recommendations for the user. The following code snippet shows the implementation of this step.

The code defines two functions: find_similar_items and process_images. find_similar_items performs semantic search using the k-nearest neighbors (kNN) algorithm on the input image prompt. It computes a multimodal embedding for the image and query prompt, constructs an OpenSearch kNN query, runs the search, and retrieves the top matching images and post texts. process_images analyzes a list of similar images in parallel using multiprocessing. It generates analysis texts for the images by calling generate_text_with_claude with an analysis prompt, running the calls in parallel, and collecting the results.

In the snippet, find_similar_items is called to retrieve the top three similar images and post texts based on the input image and a combined query prompt. process_images is then called to generate analysis texts for the first three similar images in parallel, displaying the results simultaneously.

def find_similar_items(image_bytes: str, query_prompt:str, k: int, num_results: int, index_name: str, dataset, open_search_client  ) -> []:
    """
    Main semantic search capability using knn on input image prompt.
    Args:
        k: number of top-k similar vectors to retrieve from OpenSearch index
        num_results: number of the top-k similar vectors to retrieve
        index_name: index name in OpenSearch
    """
    query_emb = get_titan_multimodal_embedding(image_bytes=image_bytes, description = query_prompt, dimension=1024)["embedding"]

    body = {
        "size": num_results,
        "_source": {
            "exclude": ["image_vector"],
        },
        "query": {
            "knn": {
                "image_vector": {
                    "vector": query_emb,
                    "k": k,
                }
            }
        },
    }     
        
    res = open_search_client.search(index=index_name, body=body)
    images = []
    texts = []
    
    for hit in res["hits"]["hits"]:
        id_ = hit["_id"]
        file_name = hit["_source"]["file_name"]
        post_text = hit["_source"]["post_text"]
        image = get_image(file_name = file_name, dataset = dataset)

        image.name_and_score = f'{hit["_score"]}:{hit["_source"]["file_name"]}'
        images.append(image)

        texts.append(f"Post Text: {post_text}")
                    
    return images, texts

def process_images(_similar_items, PROMPT_ANALYSIS):
    pool = multiprocessing.Pool(processes=3)  # Create a pool of 3 worker processes
    args = [(image, PROMPT_ANALYSIS) for image in _similar_items[:3]]
    results = pool.starmap(generate_text_with_claude, args)  # Execute the function calls in parallel
    # Unpack the results
    analysis_text_0, analysis_text_1, analysis_text_2 = results
    # Close the pool and wait for the tasks to finish
    pool.close()
    pool.join()
    return analysis_text_0, analysis_text_1, analysis_text_2

similar_images, post_texts = find_similar_items(
                                    image_bytes=image_enhanced_bytes, query_prompt=text_input + " " + post_text,
                                    k=5, num_results=3, index_name=index_name, dataset=mapping_table,
                                    open_search_client=oss_client)

analysis_text_0, analysis_text_1, analysis_text_2 = process_images(similar_images, PROMPT_ANALYSIS)

An example of historical post retrieval and analysis is shown in the following screenshot. Post images are listed on the left. On the right, the full text content of each post is retrieved and displayed. We then use an LLM model to generate a comprehensive scene description for the post image, which can serve as a prompt to inspire image generation. Next, the LLM model generates automatic recommendations for improvement. In this solution, we use the Claude 3 Sonnet model for text generation.

As the final step, the solution incorporates the recommendations and refines the post text to make it more appealing and likely to attract more attention from social media users.

Run the solution with Streamlit App

You can download the solution from this Git repository. Use the following steps to run the Streamlit application and quickly test out the solution in your SageMaker Studio environment.

  1. In SageMaker Studio, choose SageMaker Classic, then start an instance under your user profile.
  2. After you have the JupyterLab environment running, clone the code repository and navigate to the streamlit-app folder in a terminal:
    cd streamlit-app/
    sh setup.sh 
    sh run.sh 
    

  3. You will see a webpage link generated in the terminal, which will look similar to the following:

https://[USER-PROFILE-ID].studio.[REGION].sagemaker.aws/jupyter/default/proxy/8501/

  1. To check the status of the Streamlit application, run sh status.sh in the terminal.
  2. To shut down the application, run sh cleanup.sh.

With the Streamlit app downloaded, you can begin by providing initial prompts and selecting the products you want to retain in the image. You have the option to upload an image from your local machine, plug in your camera to take an initial product picture on the fly, or quickly test the solution by selecting a pre-uploaded image example. You can then optionally adjust the product’s location in the image by setting its position. Next, select the brand for the product. In the demo, we use the luxury brand and the fast fashion brand, each with its own preferences and guidelines. Finally, choose the image style. Choose Submit to start the process.

The application will automatically handle post-image and text generation, retrieve similar posts for analysis, and refine the final post. This end-to-end process can take approximately 30 seconds. If you aren’t satisfied with the result, you can repeat the process a few times. An end-to-end demo is shown below.

Inspiration from historical posts using image similarity search

If you find yourself lacking ideas for initial prompts to create the enhanced image, consider using a reverse search approach. During the retrieve and analyze posts step mentioned earlier, scene descriptions are also generated, which can serve as inspiration. You can modify these descriptions as needed and use them to generate new images and accompanying text. This method effectively uses existing content to stimulate creativity and enhance the application’s output.

In the preceding example, the top three similar images to our generated images show perfume pictures posted to social media by users. This insight helps brands understand their target audience and the environments in which their products are used. By using this information, brands can create dynamic and engaging content that resonates with their users. For instance, in the example provided, “a hand holding a glass perfume bottle in the foreground, with a scenic mountain landscape visible in the background,” is unique and visually more appealing than a dull picture of “a perfume bottle standing on a branch in a forest.” This illustrates how capturing the right scene and context can significantly enhance the attractiveness and impact of social media content.

Clean up

When you finish experimenting with this solution, use the following steps to clean up the AWS resources to avoid unnecessary costs:

  1. Navigate to the Amazon S3 console and delete the S3 bucket and data created for this solution.
  2. Navigate to the Amazon OpenSearch Service console, choose Serverless, and then select Collection. Delete the collection that was created for storing the historical post embedding vectors.
  3. Navigate to the Amazon SageMaker console. Choose Admin configurations and select Domains. Select your user profile and delete the running application from Spaces and Apps.

Conclusion

In this blog post, we introduced a multimodal social media content generator solution that uses FMs from Amazon Bedrock, such as the Amazon Titan Image Generator, Claude 3, and Amazon Titan Multimodal Embeddings. The solution streamlines the content creation process, enabling brands and influencers to produce engaging and brand-consistent content rapidly. You can try out the solution using this code sample.

The solution involves enhancing product images with relevant backgrounds using the Amazon Titan Image Generator, generating brand-aligned text descriptions through Claude 3, and retrieving similar historical posts using Amazon Titan Multimodal Embeddings. It provides actionable recommendations to refine content for better audience resonance. This multimodal AI approach addresses challenges in rapid content production, personalization, and brand consistency, empowering creators to boost creativity and engagement while maintaining brand identity.

We encourage brands, influencers, and content teams to explore this solution and use the capabilities of FMs to streamline their content creation processes. Additionally, we invite developers and researchers to build upon this solution, experiment with different models and techniques, and contribute to the advancement of multimodal AI in the realm of social media content generation.

See this announcement blog post for information about the Amazon Titan Image Generator and Amazon Titan Multimodal Embeddings model. For more information, see Amazon Bedrock and Amazon Titan in Amazon Bedrock.


About the Authors

Ying Hou, PhD, is a Machine Learning Prototyping Architect at AWS, specialising in building GenAI applications with customers, including RAG and agent solutions. Her expertise spans GenAI, ASR, Computer Vision, NLP, and time series prediction models. Outside of work, she enjoys spending quality time with her family, getting lost in novels, and hiking in the UK’s national parks.

Bishesh Adhikari, is a Senior ML Prototyping Architect at AWS with over a decade of experience in software engineering and AI/ML. Specializing in GenAI, LLMs, NLP, CV, and GeoSpatial ML, he collaborates with AWS customers to build solutions for challenging problems through co-development. His expertise accelerates customers’ journey from concept to production, tackling complex use cases across various industries. In his free time, he enjoys hiking, traveling, and spending time with family and friends.

Read More

Elevate RAG for numerical analysis using Amazon Bedrock Knowledge Bases

Elevate RAG for numerical analysis using Amazon Bedrock Knowledge Bases

In the realm of generative artificial intelligence (AI), Retrieval Augmented Generation (RAG) has emerged as a powerful technique, enabling foundation models (FMs) to use external knowledge sources for enhanced text generation.

Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Knowledge Bases is a fully managed capability that helps you implement the entire RAG workflow—from ingestion to retrieval and prompt augmentation—without having to build custom integrations to data sources and manage data flows. However, RAG has had its share of challenges, especially when it comes to using it for numerical analysis. This is the case when you have information embedded in complex nested tables. Latest innovations in  Amazon Bedrock Knowledge Base provide a resolution to this issue.

In this post, we explore how Amazon Bedrock Knowledge Bases address the use case of numerical analysis across a number of documents.

The power of RAG and its limitations

With RAG, an information retrieval component is introduced that utilizes the user input to first pull relevant information from a data source. The user query and the relevant information are both given to the large language model (LLM). The LLM uses the new knowledge and its training data to create better responses.

Although this approach holds a lot of promise for textual documents, the presence of non-textual elements, such as tables, pose a significant challenge. One issue is that the table structure by itself can be difficult to interpret when directly queried against documents in PDFs or Word. This can be addressed by transforming the data into a format such as text, markdown, or HTML.

Another issue relates to search, retrieval, and chunking of documents that contain tables. The first step in RAG is to chunk a document so you can transform that chunk of data into a vector for a meaningful representation of text. However, when you apply this method to a table, even if converted into a text format, there is a risk that the vector representation doesn’t capture all the relationships in the table. As a result, when you try to retrieve information, a lot of information is missed. Because this information isn’t retrieved, the LLM doesn’t provide accurate answers to your questions.

Amazon Bedrock Knowledge Bases provide three capabilities to resolve this issue:

  • Hybrid search – A hybrid search retrieves information based on semantic meaning through vector representations as well as through keywords. As a result, information on particular key fields that was being missed earlier using purely semantic search is retrieved, and the LLM is able to accurately provide the correct answers. For more information on Amazon Bedrock’s hybrid search capability, see Amazon Bedrock Knowledge Bases now supports hybrid search.
  • Chunking data in fixed sizes – You can specify a fixed size for the data that is eventually transformed into a vector. Small sizes imply smaller amounts of data and vice versa.
  • Retrieving a large number of chunks from your search results – These are the number of chunks retrieved as the result of your search. The greater the number of results retrieved, the more context provided to the LLM for an answer.

Using a combination of these features can enhance numerical analysis of information across multiple documents that contain data in tables. In the next section, we demonstrate this approach using a set of earnings documents from Amazon.

Solution overview

The following diagram illustrates the high-level architecture of our solution for analyzing numerical documents.

The user call flow consists of the following steps:

  1. The process begins with the user uploading one or more documents. This action initiates the workflow.
  2. The Streamlit application, which designed to facilitate user interaction, takes these uploaded documents and stores them in an Amazon Simple Storage Service (Amazon S3) bucket.
  3. After the documents are successfully copied to the S3 bucket, the event automatically invokes an AWS Lambda
  4. The Lambda function invokes the Amazon Bedrock knowledge base API to extract embeddings—essential data representations—from the uploaded documents. These embeddings are structured information that capture the core features and meanings of the documents.
  5. With the documents processed and stored, the GUI of the application becomes interactive. Users can now engage with the application by asking questions in natural language through the user-friendly interface.
  6. When a user submits a question, the application converts this query into query embeddings. These embeddings encapsulate the essence of the user’s question, which helps with retrieving the relevant context from the knowledge base.
  1. you can use the Retrieve API to query your knowledge base with information retrieved directly from the knowledge base. The RetrieveAndGenerate API uses the retrieved results to augment the foundation model (FM) prompt and returns the response.
  2. Using a hybrid search method that combines keyword-based and semantic-based techniques, the application searches its knowledge base for relevant information related to the user’s query. This search aims to find contextual answers that match both the explicit terms and the intended meaning of the question.
  3. When relevant context is identified, the application forwards this information—both the user’s query and the retrieved context—to the LLM module.
  4. The LLM module processes the provided query and context to generate a response.
  5. The application delivers the generated response back to the user through its GUI. This completes the loop of interaction, where the user’s initial query results in a comprehensive and contextually relevant response derived from the uploaded documents and the application’s knowledge base.

In the following sections, we walk through the steps to create an S3 bucket and knowledge base, deploy the Streamlit application with AWS CloudFormation, and test the solution.

Prerequisites

You should have the following prerequisites:

  • An AWS account with necessary permissions
  • Access to launch AWS CloudFormation
  • Access to the Anthropic Claude 3 Sonnet and Amazon Titan Text Embeddings v2 models on Amazon Bedrock
  • The CloudFormation template downloaded to your local computer

Create an S3 bucket

Complete the following steps to create your S3 bucket:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. Enter a unique bucket name that follows the S3 bucket naming rules.
  4. Choose the AWS Region where you want to create the bucket. It is recommended to choose Region that is geographically close to you.
  5. Leave the other settings at their default values and choose Create bucket.

Create a knowledge base

Complete the following steps to create a knowledge base with default settings:

  1. On the Amazon Bedrock console, choose Knowledge bases under Builder tools in the navigation pane.
  2. Choose Create knowledge base.
  3. In the Provide knowledge base details section, provide the following information:
  4. In the Choose data source section, select the radio button for Amazon S3 and choose Next
  5. In the Configure data source section, provide the following information
    • For S3 URI, enter the S3 path for the bucket you created.
    • For chunking and parsing configurations, select the radio button for Custom
    • For Chunking strategy, choose Fixed-size chunking.
    • For Max tokens, enter 250.
    • For Overlap percentage between chunks, enter 30.
    • Leave everything as default and choose Next.

  1. In the Select embeddings model and configure vector store section, provide the following information:
    • For Embeddings model, choose Titan Text Embeddings v2.
    • Under Vector database, select Quick create a new vector store.
    • Leave everything else as default and choose Next.

  1. Review the knowledge base settings and choose Create knowledge base.

  1. Amazon Bedrock will now provision the necessary resources and set up the knowledge base for you as shown in the screen below (Note: This process may take a few minutes to complete). Note the knowledge base ID as shown

  1. Click on the data source name and note the Data source ID as shown

Create the Streamlit application

After the knowledge base is setup using the above 9 steps, complete the following steps to create the Streamlit application using the CloudFormation template:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack.
  3. Select With new resources (standard).
  4. For the template source, choose Upload a template file.
  5. Choose Choose file and upload the template you downloaded earlier.
  6. Enter a name for your stack.
  7. Configure the following parameters:
    • KnowledgeBase Configuration
      1. For KnowledgeBasedID, enter the knowledge base ID that you saved earlier.
      2. For DatasourceID, enter the data source ID that you saved earlier.
    • S3Bucket Configuration
      1. For RAGDocumentInput, enter the name of the bucket you created.
    • S3Bucket Configuration
      1. For SubnetId, choose your public subnet
      2. For VpcId, choose the VPC ID in which you want to deploy the Streamlit application.
      3.  For YourPublicIP, enter the public IP address from where you access the Streamlit application.
    • S3Bucket Configuration
      1. For InstanceType and LatestLinuxAMI, you can use the default values
  8. Review the stack details and select the checkbox in the Capabilities section:
    • I acknowledge that AWS CloudFormation might create IAM resources
  9. Choose Create stack to initiate the stack creation process. This will take few minutes to complete.

When the stack is complete, you can refer to the stack’s Outputs tab for the Streamlit application URL.

Now that we have deployed the Streamlit application, let’s see how users can interact with it and ask questions.

Test the solution

We can divide the UI experience into two phases: document ingestion and document querying.

The document ingestion workflow consists of the following steps:

  1. Users use the Streamlit application to upload documents. For testing, we can use Amazon earnings for the last 16 quarters. The application then uploads the documents to an S3 bucket.
  2. The upload event invokes a Lambda function. This function starts the ingestion job to sync the data to the knowledge base. Using the specified chunking strategy, the knowledge base converts the documents in the S3 bucket to vector embeddings, which are stored in the default Amazon OpenSearch serverless vector database. This process will take about 15 minutes to complete. You can follow the steps mentioned here to check the sync history which contains the details about when the data source was synced, Please wait for sync to complete to start asking questions in the streamlit application

The document querying workflow consists of the following steps:

  1. Users interact with a Streamlit application running on Amazon Elastic Compute Cloud (Amazon EC2 ), where they enter questions (queries) about the documents and receive responses.
  2. The application passes the queries to the knowledge base.
  3. The knowledge base returns the most relevant text. In this case, because we’re using hybrid search, it returns keyword-based as well as semantic search-based results. In the hybrid search, we maximize the number of chunks returned (to 100).
  4. The application invokes the specified LLM in Amazon Bedrock and passes the retrieved results as context, along with the original query, to the LLM.
  5. The LLM model processes the prompts and context, and returns responses.
  6. The Streamlit application displays the results to the user.

To test this setup, we deployed this solution in an AWS account and analyzed Amazon earnings for the last 16 quarters. The results from a semantic search with default chunking vs. a hybrid search with fixed size chunking and maximum results are displayed in the following table.

Question Semantic Search with Default Chunking Hybrid Search with Fixed Size Chunking
What is the change in total operating expenses for technology and infrastructure from 2020 to 2023? Based on the information provided in the search results, I could not find the exact change in total operating expenses for technology and infrastructure from 2020 to 2023. The search results provide some financial data, but do not break down the operating expenses into specific categories like technology and infrastructure. Without that level of detail, I cannot determine the exact change in those expenses over that time period

Based on the financial information provided, Amazon’s operating expenses for technology and infrastructure increased from

42.7billion in 2020 to 85.6 billion in 2023, an increase of $42.9 billion.

What are the international operating expenses in 2020, 2021 and 2022? Based on the search results, I could not find any information about Amazon’s international operating expenses for the years 2020, 2021, and 2022. The financial information provided in the search results focuses on Amazon’s overall operating expenses, cash flows, and other financial metrics, but does not break down the expenses by international operations.

The international operating expenses for Amazon were:

2020: 103,695 million 2021:128,711 million 2022: $125,753 million

As seen in the results, the hybrid search option provides correct answers to all the questions.

Clean up

To prevent incurring costs after deploying and testing this solution, delete the S3 bucket, knowledge base, and CloudFormation stack.

Conclusion

In this post, we discussed how Amazon Bedrock Knowledge Bases provides a powerful solution that enables numerical analysis on documents. You can deploy this solution in an AWS account and use it to analyze different types of documents. As we continue to push the boundaries of generative AI, solutions like this will play a pivotal role in bridging the gap between unstructured data and actionable insights, enabling organizations to unlock the full potential of their data assets.

To further explore the advanced RAG capabilities of Amazon Bedrock Knowledge Bases, refer to the following resources:


About the Authors

Sanjeev Pulapaka is a Principal Solutions architect and the Single Threaded Leader for AI/ML in the US federal civilian team at AWS. He advises customers on AI/ML-related solutions that advance their mission. Sanjeev has extensive experience in leading, architecting, and implementing high-impact technology solutions that address diverse business needs in multiple sectors, including commercial, federal, and state and local governments. He has an undergraduate degree in engineering from the Indian Institute of Technology and an MBA from the University of Notre Dame.

Muhammad Qazafi is a Solutions Architect based in the US. He assists customers in designing, developing, and implementing secure, scalable, and innovative solutions on AWS. His objective is to help customers achieve measurable business outcomes through the effective utilization of AWS services. With over 15 years of experience, Muhammad brings a wealth of knowledge and expertise across a diverse range of industries. This extensive experience enables him to understand the unique challenges faced by different businesses and help customers create solutions on AWS.

Venkata Kampana is a Senior Solutions architect in the AWS Health and Human Services team and is based in Sacramento, California. In this role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Read More

Llama 3.2 models from Meta are now available in Amazon SageMaker JumpStart

Llama 3.2 models from Meta are now available in Amazon SageMaker JumpStart

Today, we are excited to announce the availability of Llama 3.2 models in Amazon SageMaker JumpStart. Llama 3.2 offers multi-modal vision and lightweight models representing Meta’s latest advancement in large language models (LLMs), providing enhanced capabilities and broader applicability across various use cases. With a focus on responsible innovation and system-level safety, these new models demonstrate state-of-the-art performance on a wide range of industry benchmarks and introduce features that help you build a new generation of AI experiences. SageMaker JumpStart is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML.

In this post, we show how you can discover and deploy the Llama 3.2 11B Vision model using SageMaker JumpStart. We also share the supported instance types and context for all the Llama 3.2 models available in SageMaker JumpStart. Although not highlighted in this blog, you can also use the lightweight models along with fine-tuning using SageMaker JumpStart.

Llama 3.2 models are available in SageMaker JumpStart initially in the US East (Ohio) AWS Region. Please note that Meta has restrictions on your usage of the multi-modal models if you are located in the European Union. See Meta’s community license agreement for more details.

Llama 3.2 overview

Llama 3.2 represents Meta’s latest advancement in LLMs. Llama 3.2 models are offered in various sizes, from small and medium-sized multi-modal models. The larger Llama 3.2 models come in two parameter sizes—11B and 90B—with 128,000 context length, and are capable of sophisticated reasoning tasks including multi-modal support for high resolution images. The lightweight text-only models come in two parameter sizes—1B and 3B—with 128,000 context length, and are suitable for edge devices. Additionally, there is a new safeguard Llama Guard 3 11B Vision parameter model, which is designed to support responsible innovation and system-level safety.

Llama 3.2 is the first Llama model to support vision tasks, with a new model architecture that integrates image encoder representations into the language model. With a focus on responsible innovation and system-level safety, Llama 3.2 models help you build and deploy cutting-edge generative AI models to ignite new innovations like image reasoning and are also more accessible for on-edge applications. The new models are also designed to be more efficient for AI workloads, with reduced latency and improved performance, making them suitable for a wide range of applications.

SageMaker JumpStart overview

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker Inference instances, including AWS Trainium and AWS Inferentia powered instances, and are isolated within your virtual private cloud (VPC). This enforces data security and compliance, because the models operate under your own VPC controls, rather than in a shared public environment. After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of Amazon SageMaker, including SageMaker Inference for deploying models and container logs for improved observability. With SageMaker, you can streamline the entire model deployment process.

Prerequisites

To try out the Llama 3.2 models in SageMaker JumpStart, you need the following prerequisites:

Discover Llama 3.2 models in SageMaker JumpStart

SageMaker JumpStart provides FMs through two primary interfaces: SageMaker Studio and the SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.

SageMaker Studio is a comprehensive IDE that offers a unified, web-based interface for performing all aspects of the ML development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process. In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart from the Home page.

Alternatively, you can use the SageMaker Python SDK to programmatically access and use SageMaker JumpStart models. This approach allows for greater flexibility and integration with existing AI/ML workflows and pipelines. By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI/ML development efforts, regardless of your preferred interface or workflow.

Deploy Llama 3.2 multi-modality models for inference using SageMaker JumpStart

On the SageMaker JumpStart landing page, you can discover all public pre-trained models offered by SageMaker. You can choose the Meta model provider tab to discover all the Meta models available in SageMaker.

If you’re using SageMaker Classic Studio and don’t see the Llama 3.2 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Classic Apps.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You can also find two buttons, Deploy and Open Notebook, which help you use the model.

When you choose either button, a pop-up window will show the End-User License Agreement (EULA) and acceptable use policy for you to accept.

Upon acceptance, you can proceed to the next step to use the model.

Deploy Llama 3.2 11B Vision model for inference using the Python SDK

When you choose Deploy and accept the terms, model deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, you start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker.

You can deploy a Llama 3.2 11B Vision model using SageMaker JumpStart with the following SageMaker Python SDK code:

from sagemaker.jumpstart.model import JumpStartModel
model = JumpStartModel(model_id = "meta-vlm-llama-3-2-11b-vision")
predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To successfully deploy the model, you must manually set accept_eula=True as a deploy method argument. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "How are you doing today"},
        {"role": "assistant", "content": "Good, what can i help you with today?"},
        {"role": "user", "content": "Give me 5 steps to become better at tennis?"}
    ],
    "temperature": 0.6,
    "top_p": 0.9,
    "max_tokens": 512,
    "logprobs": False
}
response = predictor.predict(payload)
response_message = response['choices'][0]['message']['content']

Recommended instances and benchmark

The following table lists all the Llama 3.2 models available in SageMaker JumpStart along with the model_id, default instance types, and the maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models. For increased context length, you can modify the default instance type in the SageMaker JumpStart UI.

Model Name Model ID Default instance type Supported instance types
Llama-3.2-1B meta-textgeneration-llama-3-2-1b,
meta-textgenerationneuron-llama-3-2-1b
ml.g6.xlarge (125K context length),
ml.trn1.2xlarge (125K context length)
All g6/g5/p4/p5 instances;
ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Llama-3.2-1B-Instruct meta-textgeneration-llama-3-2-1b-instruct,
meta-textgenerationneuron-llama-3-2-1b-instruct
ml.g6.xlarge (125K context length),
ml.trn1.2xlarge (125K context length)
All g6/g5/p4/p5 instances;
ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Llama-3.2-3B meta-textgeneration-llama-3-2-3b,
meta-textgenerationneuron-llama-3-2-3b
ml.g6.xlarge (125K context length),
ml.trn1.2xlarge (125K context length)
All g6/g5/p4/p5 instances;
ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Llama-3.2-3B-Instruct meta-textgeneration-llama-3-2-3b-instruct,
meta-textgenerationneuron-llama-3-2-3b-instruct
ml.g6.xlarge (125K context length),
ml.trn1.2xlarge (125K context length)
All g6/g5/p4/p5 instances;
ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Llama-3.2-11B-Vision meta-vlm-llama-3-2-11b-vision ml.p4d.24xlarge (125K context length) p4d.24xlarge,
p4de.24xlarge,
p5.48xlarge
Llama-3.2-11B-Vision-Instruct meta-vlm-llama-3-2-11b-vision-instruct ml.p4d.24xlarge (125K context length) p4d.24xlarge,
p4de.24xlarge,
p5.48xlarge
Llama-3.2-90B-Vision meta-vlm-llama-3-2-90b-vision ml.p5.24xlarge (125K context length) p4d.24xlarge,
p4de.24xlarge,
p5.48xlarge
Llama-3.2-90B-Vision-Instruct meta-vlm-llama-3-2-90b-vision-instruct ml.p5.24xlarge (125K context length) p4d.24xlarge,
p4de.24xlarge,
p5.48xlarge
Llama-Guard-3-11B-Vision meta-vlm-llama-guard-3-11b-vision ml.p4d.24xlarge p4d.24xlarge,
p4de.24xlarge,
p5.48xlarge

Llama 3.2 models have been evaluated on over 150 benchmark datasets, demonstrating competitive performance with leading FMs.

Inference and example prompts for Llama-3.2 11B Vision

You can use Llama 3.2 11B and 90B models for text and image or vision reasoning use cases. You can perform a variety of tasks, such as image captioning, image text retrieval, visual question answering and reasoning, document visual question answering, and more. Input payload to the endpoint looks like the following code examples.

Text-only input

The following is an example of text-only input:

{
    'messages': [
        {'role': 'system', 'content': 'You are a helpful assistant'}, 
        {'role': 'user', 'content': 'How are you doing today'}, 
        {'role': 'assistant', 'content': 'Good, what can i help you with today?'}, 
        {'role': 'user', "content": 'Give me 5 steps to become better at tennis?'}
    ], 
    'temperature': 0.6, 
    'top_p': 0.9, 
    'max_tokens': 512, 
    'logprobs': False
}

This produces the following response:

Here are 5 steps to help you improve your tennis game:

**Step 1: Develop a Consistent Serve**

A strong serve is the foundation of a good tennis game. Practice your serve regularly to develop a consistent and powerful shot. Focus on proper footwork, grip, and body positioning. Experiment with different serves, such as topspin and slice, to keep your opponent guessing.

**Step 2: Improve Your Footwork and Agility**

Good footwork is essential for getting into position to hit shots and covering the court. Practice shuffling, sliding, and turning to improve your agility and reaction time. Incorporate exercises like ladder drills and cone drills into your training to enhance your foot speed and agility.

**Step 3: Focus on Your Volley and Overhead Shots**

The volley and overhead shots are critical for winning points at the net. Practice hitting volleys and overheads with both forehand and backhand grips. Focus on keeping your racquet close to the ball and using your wrist and arm to generate power.

**Step 4: Work on Your Return of Serve**

The return of serve is a crucial shot in tennis, as it sets up the point for your opponent. Practice returning serves with different types of spin, such as topspin and slice. Focus on getting your racquet head to the ball early and using your legs to generate power.

**Step 5: Analyze Your Game and Practice with a Purpose**

To improve your game, you need to identify areas for improvement and practice with a purpose. Record your matches and analyze your game to identify weaknesses and areas for improvement. Create a practice plan that targets specific areas, such as your forehand or backhand, and focus on making progress in those areas.

Remember, improvement takes time and practice. Stay committed, and with consistent effort, you'll see improvement in your tennis game!

Single-image input

You can set up vision-based reasoning tasks with Llama 3.2 models with SageMaker JumpStart as follows:

import requests
import base64

def url_to_base64(image_url):
    # Download the image
    response = requests.get(image_url)
    if response.status_code != 200:
        return None

    # Encode the image content to base64
    image_base64 = base64.b64encode(response.content).decode('utf-8')
    return image_base64

Let’s load an image from the open source MATH-Vision dataset:

url = "https://raw.githubusercontent.com/mathvision-cuhk/MATH-V/refs/heads/main/images/13.jpg"
image_data = url_to_base64(url)

We can structure the message object with our base64 image data:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Which of these figures differs from the rest four?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"""data:image/jpeg;base64,{image_data}"""
          }
        }
      ]
    }
  ],
  "temperature": 0.4,
  "top_p": 0.95,
  "max_tokens": 512,
  "logprobs": False
}

This produces the following response:

The figure labeled E is the one that differs from the rest.

Multi-image input

The following code is an example of multi-image input:

image_data1 = url_to_base64("https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2023/04/27/DBBLOG-2448-SingleAZInstance.png")
image_data2 = url_to_base64("https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2023/04/27/DBBLOG-2448-MultiAZInstance.png")
payload = {
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "There are two images. Describe each one and why they are related"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"""data:image/jpeg;base64,{image_data1}"""
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"""data:image/jpeg;base64,{image_data2}"""
          }
        }
      ]
    }
  ],
  "temperature": 0.6,
  "top_p": 0.9,
  "max_tokens": 512,
  "logprobs": False
}

This produces the following response:

The two images depict different architectures for a distributed database system. The main difference between them is the location of the database instances and how they are connected.

**Image 1: Centralized Architecture**

In this image, there is a single database instance (DB1) that serves as the central repository for all data. This instance is connected to multiple application servers (App1, App2, App3) through a network. The application servers interact with the database instance to read and write data.

**Image 2: Distributed Architecture**

In this image, there are multiple database instances (DB1, DB2, DB3) that are distributed across different locations. Each database instance is connected to a local application server (App1, App2, App3) through a network. The application servers interact with their local database instances to read and write data.

The key differences between the two images are:

* **Scalability**: The distributed architecture (Image 2) allows for easier scalability, as new database instances can be added or removed as needed, without affecting the overall system.
* **Availability**: The distributed architecture provides higher availability, as if one database instance goes down, the other instances can still serve requests.
* **Performance**: The centralized architecture (Image 1) may experience performance issues if the central database instance becomes a bottleneck, while the distributed architecture can distribute the load across multiple instances.
* **Data consistency**: The centralized architecture may have easier data consistency management, as all data is stored in a single location, while the distributed architecture requires more complex mechanisms to ensure data consistency across multiple instances.

In summary, the centralized architecture is suitable for small to medium-sized applications with low traffic, while the distributed architecture is more suitable for large-scale applications with high traffic and scalability requirements.

Clean up

To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints using the following code snippets:

predictor.delete_model()
predictor.delete_endpoint()

Alternatively, to use the SageMaker console, complete the following steps:

  1. On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
  2. Search for the embedding and text generation endpoints.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

In this post, we explored how SageMaker JumpStart empowers data scientists and ML engineers to discover, access, and deploy a wide range of pre-trained FMs for inference, including Meta’s most advanced and capable models to date. Get started with SageMaker JumpStart and Llama 3.2 models today. For more information about SageMaker JumpStart, see Train, deploy, and evaluate pretrained models with SageMaker JumpStart and Getting started with Amazon SageMaker JumpStart.


About the Authors

Supriya Puragundla is a Senior Solutions Architect at AWS
Armando Diaz is a Solutions Architect at AWS
Sharon Yu is a Software Development Engineer at AWS
Siddharth Venkatesan is a Software Development Engineer at AWS
Tony Lian is a Software Engineer at AWS
Evan Kravitz is a Software Development Engineer at AWS
Jonathan Guinegagne is a Senior Software Engineer at AWS
Tyler Osterberg is a Software Engineer at AWS
Sindhu Vahini Somasundaram is a Software Development Engineer at AWS
Hemant Singh is an Applied Scientist at AWS
Xin Huang is a Senior Applied Scientist at AWS
Adriana Simmons is a Senior Product Marketing Manager at AWS
June Won is a Senior Product Manager at AWS
Karl Albertsen is a Head of ML Algorithm and JumpStart at AWS

Read More

Vision use cases with Llama 3.2 11B and 90B models from Meta

Vision use cases with Llama 3.2 11B and 90B models from Meta

Today, we are excited to announce the availability of Llama 3.2 in Amazon SageMaker JumpStart and Amazon Bedrock. The Llama 3.2 models are a collection of state-of-the-art pre-trained and instruct fine-tuned generative AI models that come in various sizes—in lightweight text-only 1B and 3B parameter models suitable for edge devices, to small and medium-sized 11B and 90B parameter models capable of sophisticated reasoning tasks, including multimodal support for high-resolution images. SageMaker JumpStart is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, like Meta, through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

In this post, we demonstrate how you can use Llama 3.2 11B and 90B models for a variety of vision-based use cases. This is the first time Meta’s Llama models have been released with vision capabilities. These new capabilities expand the usability of Llama models from their traditional text-only applications. The vision-based use cases that we discuss in this post include document visual question answering, extracting structured entity information from images, and image captioning.

Overview of Llama 3.2 11B and 90B Vision models

The Llama 3.2 collection of multimodal and multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in a variety of sizes. The 11B and 90B models are multimodal—they support text in/text out, and text+image in/text out.

Llama 3.2 11B and 90B are the first Llama models to support vision tasks, with a new model architecture that integrates image encoder representations into the language model. The new models are designed to be more efficient for AI workloads, with reduced latency and improved performance, making them suitable for a wide range of applications. All Llama 3.2 models support a 128,000 context length, maintaining the expanded token capacity introduced in Llama 3.1. Additionally, the models offer improved multilingual support for eight languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Llama 3.2 models are available today for inferencing in SageMaker JumpStart and Amazon Bedrock. With SageMaker JumpStart, you can access Llama 3.2 models initially in the US East (Ohio) AWS region and support the required instance types. Meta’s Llama 3.2 90B and 11B models are also available in Amazon Bedrock in the US West (Oregon) Region, and in the US East (Ohio, N. Virginia) Regions via cross-region inference. Llama 3.2 1B and 3B models are available in the US West (Oregon) and Europe (Frankfurt) Regions, and in the US East (Ohio, N. Virginia) and Europe (Ireland, Paris) Regions via cross-region inference with planned expanded regional availability in the future.

Solution overview

In the following sections, we walk through how to configure Llama 3.2 vision models in Amazon Bedrock and Amazon SageMaker JumpStart for vision-based reasoning. We also demonstrate use cases for document question answering, entity extraction, and caption generation.

For the examples shown in this post, we use the Llama 3.2 90B model unless otherwise noted. The fashion images are from the Fashion Product Images Dataset. Caption generation images are from Human Preference Synthetic Dataset. The interior design and real estate images are from the Interior design dataset.

Prerequisites

The following prerequisites are needed to implement the steps outlined in this post:

For information about how to set up Llama 3.2 model access for Amazon Bedrock, see launch post. For details on creating model endpoints in SageMaker JumpStart, refer to the launch post.

Configure Llama 3.2 for vision-based reasoning in Amazon Bedrock

To set up vision-based reasoning tasks with Llama 3.2 models in Amazon Bedrock, use the following code snippet:

import boto3
import json
import base64
from botocore.config import Config

# Initialize the Bedrock client
config = Config(
            region_name = os.getenv("BEDROCK_REGION", "us-west-2"),
            )
bedrock_runtime = boto3.client('bedrock-runtime', config=config)
MODEL_ID = " us.meta.llama3-2-90b-instruct-v1:0"

Amazon Bedrock supports the messages object as part of the Converse API. With the Converse API, you don’t have to convert the image into base64 (compared to SageMaker JumpStart).

You can read the image with the following code:

# Read and encode the image
image_path = "<your_file_path>"  # Replace with the actual path to your image
try:
    # Open the image file and read its contents
    with open(image_path, "rb") as image_file:
        image_bytes = image_file.read()
    # Encode the image bytes to base64
    image_data = image_bytes
except FileNotFoundError:
    print(f"Image file not found at {image_path}")
    image_data = None 

Use the following code to create a messages object:

# Construct the messages for the model input

# Construct the messages for the model input
messages = [    
    {
        "role": "user",
        "content": [
            {                
                "text": prompt
            },
            {                
                "image": {
                    "format": "<your_file_format>",
                    "source": {
                        "bytes":image_data
                }
            }
        ]
    }
]

Invoke the Amazon Bedrock Converse API as follows:

try:
    # Invoke the SageMaker endpoint
    response = bedrock_runtime.converse(
        modelId=MODEL_ID, # MODEL_ID defined at the beginning
        messages=[
            messages
        ],
        inferenceConfig={
        "maxTokens": 4096,
        "temperature": 0,
        "topP": .1
        },        
    )
    
    # Read the response 
    print(response['output']['message']['content'][0]['text'])

except Exception as e:
    print(f"An error occurred while invoking the endpoint: {str(e)}")

Configure Llama 3.2 for vision-based reasoning in SageMaker

You can set up vision-based reasoning tasks with Llama 3.2 vision models with a SageMaker endpoint with the following code snippet (please refer to Llama 3.2 in SageMaker JumpStart blog to setup the inference endpoint):

import boto3
import json
import base64

# Initialize the SageMaker runtime client
sagemaker_runtime = boto3.client('sagemaker-runtime')
endpoint_name = '<model-endpoint>'  # Replace with your actual endpoint name

SageMaker JumpStart deployment can also take in a Messages API style messages object as the input (similar to the Amazon Bedrock Converse API). First, the image needs to be read into a base64 format before sending it through the messages object.

Read the image with the following code:

# Read and encode the image
image_path = "<your_file_path>"  # Replace with the actual path to your image
try:
    # Open the image file and read its contents
    with open(image_path, "rb") as image_file:
        image_bytes = image_file.read()
    # Encode the image bytes to base64
    image_data = base64.b64encode(image_bytes).decode('utf-8')
    image_media_type = 'image/jpeg'  # Adjust if using a different image format
except FileNotFoundError:
    print(f"Image file not found at {image_path}")
    image_data = None
    image_media_type = None

Create a messages object with the following code:

# Create a data URL for the image
my_url = f"""data:image/jpeg;base64,{image_data}"""

# Construct the messages for the model input
messages = [    
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": prompt
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": my_url
                }
            }
        ]
    }
]

In the preceding code, prompt is the question we ask about the reasoning of the model with the image.

After you create the messages object, you can send that as payload to the SageMaker endpoint:

try:
    # Invoke the SageMaker endpoint
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    
    # Read the response body
    response_body = response['Body'].read()
    
    if response_body:
        try:
            # Parse the JSON response
            result = json.loads(response_body.decode('utf-8'))
            # Print the model's response
            print(result['choices'][0]['message']['content'])
        except json.JSONDecodeError as json_err:
            print(f"Failed to parse JSON: {json_err}")
            print(f"Raw response: {response_body['choices'][0]['message']['content']}")
    else:
        print("The response body is empty")

except Exception as e:
    print(f"An error occurred while invoking the endpoint: {str(e)}")

Document question answering

In this section, we present example use cases for document question answering. We focus on two use cases: first on answering questions based on financial results slides that contain complex graphical information, and secondly solving math-based problems by looking at an image.

Financial results slides Q&A

Llama 3.2 vision models can reason and answer questions based on images with complicated information in the image such as plots, graphs, and tables. Let’s use the model for a financial analyst assistant to analyze slides with different types of financial information to answer our questions. We use the Amazon Q2 2024 Financial Results Conference Calls Slides document to ask Llama 3.2 some questions based on this slide deck.

For our first prompt, we ask “How is the sales trending over the last twelve months?” with the following slide.

We get the following response:
The sales are trending up 12% year over year. The bar graph shows that the sales have increased from $538,046 in Q2 2023 to $604,334 in Q1 2024, and further to $604,334 in Q2 2024. This indicates a steady growth in sales over the last twelve months.

Next, we ask “Can you please help me understand this slide? How are the operating and free cash flow trending the last 12 month period? What is the significant contributor to the free cash flow trend?” We use the following slide.

We get the following response:
The slide presents a table with two main sections: "Operating Cash Flow" and "Free Cash Flow Less Equipment Finance Leases and Principal Repayments of All Other Finance Leases and Financing Obligations." The table is divided into five columns, each representing a quarter from Q2 2023 to Q2 2024.

**Operating Cash Flow Trend:**

*   Q2 2023: $61,841

*   Q3 2023: $71,654

*   Q4 2023: $84,946

*   Q1 2024: $99,147

*   Q2 2024: $107,952

The operating cash flow has consistently increased over the five quarters, indicating a positive trend.

**Free Cash Flow Trend:**

*   Q2 2023: $6,734

*   Q3 2023: $20,241

*   Q4 2023: $35,549

*   Q1 2024: $48,778

*   Q2 2024: $51,448

The free cash flow has also shown an upward trend, with a significant increase from Q2 2023 to Q2 2024.

**Significant Contributor to Free Cash Flow Trend:**

The significant contributor to the free cash flow trend is the "Principal repayments of all other finance leases" and "Principal repayments of financing obligations." These two items have consistently increased over the five quarters, contributing to the overall growth in free cash flow.

In summary, both operating cash flow and free cash flow have demonstrated a positive trend over the last 12-month period. The significant contributor to the free cash flow trend is the increase in principal repayments of all other finance leases and financing obligations.

Visual math question answering

Another interesting problem for visual question answering is answering visual math and logic puzzles. Llama 3.2 vision models are highly capable in solving and answering visual math questions. We show a few examples here.

For our first prompt, we use the following image and ask “The kangaroo is inside how many circles?”

We get the following response:
The kangaroo is inside 3 circles.

Next, we use the following image and ask “Which of these figures differ from the other four?”

We get the following response:
The figure labeled E is the only one with a different shape for the skirt. The other figures have triangular skirts, while figure E has a rectangular skirt.

Entity extraction

Extracting information about products is an important use case for the retail industry. On platforms supporting third-party sellers, product images are the most available resources to gather information about the product. Therefore, accurately extracting relevant information about the product is essential for the success of the product. In this section, we show a couple of examples of clothing product images, and how you can use Llama 3.2 to extract relevant information about the product that can support it on an ecommerce application.

We first use the following images and the prompt “Can you please extract all information about the product in the image? Think very carefully in a step-by-step manner to construct a json file to organize all the information about the product. Please consider formulating the json file so that the extracted information can be stored in a database for further use in a retail ecommerce application. Return the json file as the output.”

For the pink dress, we get the following output:
Based on the image, here is a JSON representation of the product information:

```
{
"product_id": "unknown",
"name": "Pink Floral Dress",
"description": "A sleeveless dress with a pink floral pattern and a pink trim at the bottom",
"category": "Dresses",
"sub_category": "Casual Dresses",
"color": "Pink",
"size": "unknown",
"price": "unknown",
"brand": "unknown",
"material": "unknown",
"features": [
"Sleeveless",
"Floral pattern",
"Pink trim at the bottom"
],
"images": [
{
"url": "image_url",
"alt_text": "Pink Floral Dress"
}
]
}
```

Please note that some information such as product_id, size, price, brand, and material are unknown as they are not visible in the image. Also, the image_url is a placeholder and should be replaced with the actual URL of the image.

For the blue cap, we get the following output:

Based on the image, here is a JSON representation of the product information:

```json
{
"product_id": "NIKE-CAP-001",
"name": "Nike Cap",
"description": "A blue baseball cap with a white Nike logo",
"category": "Accessories",
"subcategory": "Hats",
"brand": "Nike",
"color": "Blue",
"size": "One Size",
"material": "Cotton",
"price": 29.99,
"image_url": "https://example.com/nike-cap.jpg",
"features": [
"Adjustable strap in the back",
"Embroidered Nike logo on the front"
]
}
```

This JSON object includes the following fields:

* `product_id`: a unique identifier for the product
* `name`: the name of the product
* `description`: a brief description of the product
* `category` and `subcategory`: the categories and subcategories that the product belongs to
* `brand`: the brand of the product
* `color`: the color of the product
* `size`: the size of the product
* `material`: the material that the product is made of
* `price`: the price of the product
* `image_url`: the URL of the product image
* `features`: an array of features that describe the product

Note that this is just one possible way to represent the product information in JSON. Depending on the specific requirements of your application, you may need to add or remove fields, or use different data types for certain fields.

Caption generation

Another important and useful capability of vision models is to generate captions automatically for images. In this section, we showcase caption generation with a simple caption, a more creative and funnier caption, and a marketing style caption.

Simple captioning

For simple captioning, we use the following image and the prompt “Please provide a caption for this image. Be creative but respectful.”

We get the following response:

The image shows a young woman sitting in a dentist's chair, with a female dentist examining her teeth. The woman is wearing a bib to protect her clothes from any water or other substances that may be used during the examination. The dentist is wearing a white lab coat and a mask, and she is holding a dental mirror and a probe. The background of the image is a dental office, with various dental tools and equipment visible.

The overall atmosphere of the image is one of professionalism and care. The dentist appears to be focused on her work, and the woman seems to be relaxed and comfortable. The image suggests that the woman is receiving a routine dental checkup or cleaning, and that the dentist is taking good care of her.

Creative captioning

For our second example, we want a more creative and funny tone. We use the same image and the prompt “Please provide a short caption for this image. Be creative and funny but respectful. This is meant to be shared on a social network application.”

We get the following response:

"Brace yourself for a sparkling smile, because this dentist is about to make your teeth shine brighter than a disco ball!"

Listing generation

To generate a real estate listing, we use the following image and the prompt “You are a real estate marketing expert. Please create a listing description with the attached image of a property.”

We get the following output:

**Welcome to Your Dream Home**

This stunning property boasts an open-plan living area that seamlessly blends style and functionality. The spacious living room, dining area, and kitchen are perfect for entertaining, with plenty of natural light pouring in through the large windows. The sleek, modern kitchen features high-end appliances and ample counter space, making meal prep a breeze.

The living room is a cozy retreat, with a comfortable sofa and chairs arranged around a stylish coffee table. The adjacent dining area is perfect for hosting dinner parties, with a large table that seats eight.

But the real showstopper is the stunning view of the water, which can be enjoyed from the comfort of your own home. Imagine sipping your morning coffee or enjoying a glass of wine in the evening while taking in the breathtaking scenery.

This property is the perfect blend of form and function, offering the ultimate in comfort and style. Don't miss out on this incredible opportunity to make it your own.

**Key Features:**

* Open-plan living area with kitchen, dining area, and living room
* Modern kitchen with high-end appliances and ample counter space
* Comfortable living room with sofa and chairs
* Large dining table that seats eight
* Stunning view of the water
* Perfect for entertaining

**Don't miss out on this incredible opportunity. Contact us today to schedule a viewing.**

Conclusion

In this post, we demonstrated the diverse vision capabilities of the Llama 3.2 11B and 90B models from Meta. Llama 3.2 vision models enable you to solve multiple use cases, including document understanding, math and logic puzzle solving, entity extraction, and caption generation. These capabilities can drive productivity in a number of enterprise use cases, including ecommerce (retail), marketing, and much more.

To learn more about Llama 3.2 features and capabilities in Amazon Bedrock, refer to the launch post, product page, and documentation. To learn more about using Llama 3.2 in SageMaker JumpStart, see the launch post, and for more information about using foundation models in SageMaker JumpStart, check out product page and documentation.

We can’t wait to see what you build with the Llama 3.2 models on AWS!


About the Authors

Dr. Natarajan Chennimalai Kumar is a Principal Solutions Architect in the 3rd Party Model Provider team at AWS, working closely with the Llama partner engineering team at Meta to enable AWS customers use Llama models. He holds a PhD from University of Illinois at Urbana-Champaign. He is based in the Bay Area in California. Outside of work, he enjoys watching shows with his kids, playing tennis, and traveling with his family.

Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the outdoors with his wife.

Marco Punio is a Sr. Specialist Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyperscale on AWS. As a member of the 3rd Party Model Provider Applied Sciences Solutions Architecture team at AWS, he is a Global Lead for the Meta – AWS Partnership and technical strategy. Based in Seattle, WA, Marco enjoys writing, reading, exercising, and building applications in his free time.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Read More

Research Focus: Week of September 23, 2024

Research Focus: Week of September 23, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus | September 23, 2024

ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons

Time-series forecasting is a technique used to predict future values based on previously observed data points over time. It has extensive applications for traffic flow, renewable energy, retail, finance, and climate, among other uses. For these applications, it is crucial to provide forecasts across different prediction horizons, addressing both short- and long-term planning needs. Many decision-making processes also require not only point forecasts to quantify planning efficiency but also robust distributional estimations to manage uncertainty effectively. 

Delivering precise point and distributional forecasts across a spectrum of prediction horizons is a significant challenge. Prior research on developing deep learning models for time-series forecasting has often concentrated on isolated aspects, such as long-term point forecasting or short-term probabilistic estimations. This may result in skewed methodological choices and hinder the adaptability of these models to uncharted scenarios. While there is a rising trend in developing universal forecasting models, a thorough understanding of their advantages and drawbacks is still lacking.  

In a recent paper: ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons, researchers from Microsoft and external collaborators present a platform to evaluate these fundamental forecasting needs and to conduct a rigorous comparative analysis of related recent studies. They examine the latest models for universal time-series forecasting and discover that their analyses of methodological strengths and weaknesses are also applicable to these universal models. They then outline the limitations inherent in current research and underscore several avenues for future exploration. 


SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

Information retrieval (IR) involves identifying and retrieving recorded data that is relevant to an information need. Large-scale test collections play a crucial role in IR research. However, existing IR research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments – a time-intensive and expensive process. Recent studies have shown the strong capability of large language models (LLMs) in producing reliable relevance judgments with human accuracy but at a greatly reduced cost.

In a recent paper: SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval (opens in new tab), researchers from Microsoft and external colleagues address the missing large-scale ad-hoc document retrieval dataset. They extend the TREC Deep Learning Track (opens in new tab) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Such a test collection includes more than 1,900 test queries from previous tracks. The researchers compare system evaluation with past human labels and show that their synthetically created large-scale test collection can lead to highly correlated system rankings. 

Spotlight: Blog post

Research Focus: Week of September 9, 2024

Investigating vulnerabilities in LLMs; A novel total-duration-aware (TDA) duration model for text-to-speech (TTS); Generative expert metric system through iterative prompt priming; Integrity protection in 5G fronthaul networks.


Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling

LLMs are used for a wide variety of tasks and scenarios, such as chat, question answering, code generation, summarization and reasoning. These tasks exhibit variations in their input and output characteristics. Requests for different tasks with distinct input and output characteristics are often served concurrently at a single model instance, which can lead to spikes in end-to-end latency, time to generate the first token, and time between tokens (in the case of a streaming request). Understanding the interplay between requests of different characteristics is important for optimizing the end-to-end performance during LLM inference.

In a recent preprint, Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling, researchers from Microsoft propose a heuristic-guided reinforcement learning-based intelligent router for data-driven and workload-aware scheduling. This router leverages a trainable response-length predictor, and a novel formulation for estimating the impact of mixing different workloads to schedule queries across LLM instances and achieve over 11% lower end-to-end latency than existing approaches.


INTERNSHIP OPPORTUNITY

Apply now: Microsoft Research Undergrad Internship Program – Summer 2025

The Microsoft Research Undergrad Internship Program offers 12-week internships in Redmond, Washington; New York City; or Cambridge, Massachusetts, for rising college juniors and seniors who are passionate about technology and champion diversity and inclusion.

Come work alongside world-class researchers on state-of-the-art projects. Participants will collaborate with an extended network of visiting faculty, postdoctoral researchers, data and applied scientists, engineers, designers, and doctoral students to make important contributions to new and ongoing research. On-the-job learning will be augmented with mentoring, community building, and networking opportunities. Candidates from groups currently underrepresented in engineering and computer science are strongly encouraged to apply.

Applications (opens in new tab) will be accepted until October 21, 2024. Apply now!

The post Research Focus: Week of September 23, 2024 appeared first on Microsoft Research.

Read More

Decoding How AI Can Accelerate Data Science Workflows

Decoding How AI Can Accelerate Data Science Workflows

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for RTX workstation and PC users.

Across industries, AI is driving innovation and enabling efficiencies — but to unlock its full potential, the technology must be trained on vast amounts of high-quality data.

Data scientists play a key role in preparing this data, especially in domain-specific fields where specialized, often proprietary data is essential to enhancing AI capabilities.

To help data scientists with increasing workload demands, NVIDIA announced that RAPIDS cuDF, a library that allows users to more easily work with data, accelerates the pandas software library with zero code changes. Pandas is a flexible, powerful and popular data analysis and manipulation library for the Python programming language. With cuDF, data scientists can now use their preferred code base without compromising on data processing speed.

NVIDIA RTX AI hardware and technologies can also deliver data processing speedups. They include powerful GPUs that deliver the computational performance necessary to quickly and efficiently accelerate AI at every level — from data science workflows to model training and customization on PCs and workstations.

The Data Science Bottleneck

The most common data format is tabular data, which is organized in rows and columns. Smaller datasets can be managed with spreadsheet tools like Excel, however, datasets and modeling pipelines with tens of millions of rows typically rely on dataframe libraries in programming languages like Python.

Python is a popular choice for data analysis, primarily because of the pandas library, which features an easy-to-use application programming interface (API). However, as dataset sizes grow, pandas struggles with processing speed and efficiency in CPU-only systems. The library also notoriously struggles with text-heavy datasets, which is an important data type for large language models.

When data requirements outgrow pandas’ capabilities, data scientists are faced with a dilemma: endure slow processing timelines or take the complex and costly step of switching to more efficient but less user-friendly tools.

Accelerating Preprocessing Pipelines With RAPIDS cuDF 

RAPIDS cuDF speeds the popular pandas library up to 100x on RTX-powered AI PCs and workstations.

With RAPIDS cuDF, data scientists can use their preferred code base without sacrificing processing speed.

RAPIDS is an open-source suite of GPU-accelerated Python libraries designed to improve data science and analytics pipelines. cuDF is a GPU DataFrame library that provides a pandas-like API for loading, filtering and manipulating data.

Using cuDF’s “pandas accelerator mode,” data scientists can run their existing pandas code on GPUs to take advantage of powerful parallel processing, with the assurance that the code will switch to CPUs when necessary. This interoperability delivers advanced, reliable performance.

The latest release of cuDF supports larger datasets and billions of rows of tabular text data. This allows data scientists to use pandas code to preprocess data for generative AI use cases.

Accelerating Data Science on NVIDIA RTX-Powered AI Workstations and PCs

According to a recent study, 57% of data scientists use local resources such as PCs, desktops or workstations for data science.

Data scientists can achieve significant speedups starting with the NVIDIA GeForce RTX 4090 GPU. As datasets grow and processing becomes more memory-intensive, they can use cuDF to deliver up to 100x better performance with NVIDIA RTX 6000 Ada Generation GPUs in workstations, compared with traditional CPU-based solutions.

A chart show cuDF.pandas takes single-digit seconds, compared to multiple minutes on traditional pandas, to run the same operation.
Two common data science operations — “join” and “groupby” — are on the y-axis, while the x-axis shows the time it took to run each operation.

Data scientists can easily get started with RAPIDS cuDF on NVIDIA AI Workbench. This free developer environment manager powered by containers enables data scientists and developers to create, collaborate and migrate AI and data science workloads across GPU systems. Users can get started with several example projects available on the NVIDIA GitHub repository, such as the cuDF AI Workbench project.

cuDF is also available by default on HP AI Studio, a centralized data science platform designed to help AI developers seamlessly replicate their development environment from workstations to the cloud. This allows them to set up, develop and collaborate on projects without managing multiple environments.

The benefits of cuDF on RTX-powered AI PCs and workstations extend beyond raw performance speedups. It also:

  • Saves time and money with fixed-cost local development on powerful GPUs that replicates seamlessly to on-premises servers or cloud instances.
  • Enables faster data processing for quicker iterations, allowing data scientists to experiment, refine and derive insights from datasets at interactive speeds.
  • Delivers more impactful data processing for better model outcomes further down the pipeline.

Learn more about RAPIDS cuDF.

A New Era of Data Science

As AI and data science continue to evolve, the ability to rapidly process and analyze massive datasets will become a key differentiator to enable breakthroughs across industries. Whether for developing sophisticated machine learning models, conducting complex statistical analyses or exploring generative AI, RAPIDS cuDF provides the foundation for next-generation data processing.

NVIDIA is expanding that foundation by adding support for the most popular dataframe tools, including Polars, one of the fastest-growing Python libraries, which significantly accelerates data processing compared with other CPU-only tools out of the box.

Polars announced this month the open beta of the Polars GPU Engine, powered by RAPIDS cuDF. Polars users can now boost the performance of the already lightning-fast dataframe library by up to 13x.

Endless Possibilities for Tomorrow’s Engineers With RTX AI

NVIDIA GPUs — whether running in university data centers, GeForce RTX laptops or NVIDIA RTX workstations — are accelerating studies. Students in data science fields and beyond are enhancing their learning experience and gaining hands-on experience with hardware used widely in real-world applications.

Learn more about how NVIDIA RTX PCs and workstations help students level up their studies with AI-powered tools.

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

CUDA-Free Inference for LLMs

CUDA-Free Inference for LLMs

PyTorch Native Architecture Optimization: torchao

By Team PyTorch

We’re happy to officially launch torchao, a PyTorch native library that makes models faster and smaller by leveraging low bit dtypes, quantization and sparsity. torchao is an accessible toolkit of techniques written (mostly) in easy to read PyTorch code spanning both inference and training. This blog will help you pick which techniques matter for your workloads.

We benchmarked our techniques on popular GenAI models like LLama 3 and Diffusion models and saw minimal drops in accuracy. Unless otherwise noted the baselines are bf16 run on A100 80GB GPU.

Our topline metrics for llama 3 are

For inference

  • 97% speedup for Llama 3 8B using autoquant with int4 weight only quantization and hqq
  • 73% peak VRAM reduction for Llama 3.1 8B at 128K context length with a quantized KV cache

For training

  • 50% speedup for Llama 3 70B pretraining using float8 training on H100
  • 30% peak VRAM reduction for Llama 3 8B using 4 bit quantized optimizers.

Our topline metrics for diffusion model inference

  • 53% speedup using float8 dynamic quantization inference with float8 row-wise scaling on flux1.dev onH100
  • 50% reduction in model VRAM for CogVideoX using int8 dynamic quantization

Below we’ll walk through some of the techniques available in torchao you can apply to your models for inference and training.

Inference

Our inference quantization algorithms work over arbitrary PyTorch models that contain nn.Linear layers. Weight only and dynamic activation quantization for various dtypes and sparse layouts can be chosen using our top level quantize_ api

from torchao.quantization import (
quantize_,
int4_weight_only,
)
quantize_(model, int4_weight_only())

Sometimes quantizing a layer can make it slower because of overhead so if you’d rather we just pick how to quantize each layer in a model for you then you can instead run

model = torchao.autoquant(torch.compile(model, mode=’max-autotune’))

quantize_ API has a few different options depending on whether your model is compute bound or memory bound.

from torchao.quantization import (
# Memory bound models
int4_weight_only,
int8_weight_only,

# Compute bound models  
int8_dynamic_activation_int8_semi_sparse_weight,  
int8_dynamic_activation_int8_weight,  
  
# Device capability 8.9+  
float8_weight_only,  
float8_dynamic_activation_float8_weight,   )

«««< HEAD:_posts/2024-09-25-pytorch-native-architecture-optimization.md
We also have extensive benchmarks on diffusion models in collaboration with the HuggingFace diffusers team in diffusers-torchao where we demonstrated 53.88% speedup on Flux.1-Dev and 27.33% speedup on CogVideoX-5b
=======

We also have extensive benchmarks on diffusion models in collaboration with the HuggingFace diffusers team in diffusers-torchao where we demonstrated 53.88% speedup on Flux.1-Dev and 27.33% speedup on CogVideoX-5b

97898699f7101b847da377106274783ced03bb3d:_posts/2024-09-25-pytorch-native-architecture-optimizaion.md

Our APIs are composable so we’ve for example composed sparsity and quantization to bring 5% speedup for ViT-H inference

But also can do things like quantize weights to int4 and the kv cache to int8 to support Llama 3.1 8B at the full 128K context length running in under 18.9GB of VRAM.

QAT

Post training quantization, especially at less than 4 bit can suffer from serious accuracy degradations. Using Quantization Aware Training (QAT) we’ve managed to recover up to 96% of the accuracy degradation on hellaswag. We’ve integrated this as an end to end recipe in torchtune with a minimal tutorial

Training

Low precision compute and communications

torchao provides easy to use e2e workflows for reducing the precision of training compute and distributed communications, starting with float8 for `torch.nn.Linear` layers.Here is a one-liner to convert the compute gemms of your training run to float8:

from torchao.float8 import convert_to_float8_training
convert_to_float8_training(model)

For an e2e example of how to speed up LLaMa 3 70B pretraining by up to 1.5x with float8, see our README, and torchtitan’s blog and float8 recipe.

Performance and accuracy of float8 pretraining of LLaMa 3 70B, vs bfloat16


(source: https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359)

We are expanding our training workflows to more dtypes and layouts

  1. NF4 QLoRA in torchtune
  2. Prototype int8 training support
  3. Accelerated sparse 2:4 training

Low bit Optimizers

Inspired by Bits and Bytes we’ve also added prototype support for 8 and 4 bit optimizers as a drop in replacement for AdamW.

from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit
optim = AdamW8bit(model.parameters())

Integrations

We’ve been actively working on making sure torchao works well in some of the most important projects in open source.

  1. Huggingface transformers as an inference backend
  2. In diffusers-torchao as a reference implementation for accelerating diffusion models
  3. In HQQ for fast 4 bit inference
  4. In torchtune for PyTorch native QLoRA and QAT recipes
  5. In torchchat for post training quantization
  6. In SGLang for for int4 and int8 post training quantization

#

Conclusion

If you’re interested in making your models faster and smaller for training or inference, we hope you’ll find torchao useful and easy to integrate.

pip install torchao

There are a lot of things we’re excited about next ranging from going lower than 4 bit, performant kernels for high-throughput inference, expanding to more layers, scaling types or granularities, MX hardware support and supporting more hardware backends. If any of the above sounds exciting you can follow our progress at: https://github.com/pytorch/ao

If you’re interested in working on torchao, we’ve created a contributors guide, and if you have any questions we hang out on the #torchao channel on discord.gg/cudamode

Acknowledgements

We are fortunate to stand on the shoulders of giants and collaborate with some of the best people in open source. Thank you!

  1. Bits and Bytes for pioneering work in low bit optimizers and QLoRA
  2. Answer.ai for their engineering work to get FSDP and QLoRA composing
  3. Mobius Labs for the lovely back and forths on quantization algorithms and low bit kernels
  4. HuggingFace transformers for their help in battle testing and integrating our work
  5. HuggingFace diffusers for our collaboration on extensive benchmarks and best practices
  6. torch.compile so we could write our algorithms in pure PyTorch
  7. CUDA MODE for most of our early contributors

Read More