Accelerate your financial statement analysis with Amazon Bedrock and generative AI

Accelerate your financial statement analysis with Amazon Bedrock and generative AI

The financial and banking industry can significantly enhance investment research by integrating generative AI into daily tasks like financial statement analysis. By taking advantage of advanced natural language processing (NLP) capabilities and data analysis techniques, you can streamline common tasks like these in the financial industry:

  • Automating data extraction – The manual data extraction process to analyze financial statements can be time-consuming and prone to human errors. Generative AI models can automate finding and extracting financial data from documents like 10-Ks, balance sheets, and income statements. Foundation model (FMs) are trained to identify and extract relevant information like expenses, revenue, and liabilities.
  • Trend analysis and forecasting – Identifying trends and forecasting requires domain expertise and advanced mathematics. This limits the ability for individuals to run one-time reporting, while creating dependencies within an organization on a small subset of employees. Generative AI applications can analyze financial data and identify trends and patterns while forecasting future financial performance, all without manual intervention from an analyst. Removing the manual analysis step and allowing the generative AI model to build a report analyzing trends in the financial statement can increase the organization’s agility to make quick market decisions.
  • Financial reporting statements – Writing detailed financial analysis reports manually can be time-consuming and resource intensive. Dedicated resources to generate financial statements can create bottlenecks within the organization, requiring specialized roles to handle the translation of financial data into a consumable narrative. FMs can summarize financial statements, highlighting key metrics found through trend analysis and providing insights. An automated report writing process not only provides consistency and speed, but minimizes resource constraints in the financial reporting process.

Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. Amazon Bedrock offers a serverless experience, so you can get started quickly, privately customize FMs with your own data, and quickly integrate and deploy them into your applications using AWS tools without having to manage infrastructure.

In this post, we demonstrate how to deploy a generative AI application that can accelerate your financial statement analysis on AWS.

Solution overview

Building a generative AI application with Amazon Bedrock to analyze financial statements involves a series of steps, from setting up the environment to deploying the model and integrating it into your application.

The following diagram illustrates an example solution architecture using AWS services.

  

The workflow consists of the following steps:

  1. The user interfaces with a web or mobile application, where they upload financial documents.
  2. Amazon API Gateway manages and routes the incoming request from the UI.
  3. An AWS Lambda function is invoked when new documents are added to the Amazon Simple Storage Service (Amazon S3) bucket.
  4. Amazon Bedrock analyzes the documents stored in Amazon S3. The analysis results are returned to the S3 bucket through a Lambda function and stored there.
  5. Amazon DynamoDB provides a fast, scalable way to store and retrieve metadata and analysis results to display to users.
  6. Amazon Simple Notification Service (Amazon SNS) sends notifications about the status of document processing to the application user.

In the following sections, we discuss the key considerations in each step to build and deploy a generative AI application.

Prepare the data

Gather the financial statements you want to analyze. These can be balance sheets, income statements, cash flow statements, and so on. Make sure the data is clean and in a consistent format. You might need to preprocess the data to remove noise and standardize the format. Preprocessing the data will transform the raw data into a state that can be efficiently used for model training. This is often necessary due to messiness and inconsistencies in real-world data. The outcome is to have consistent data for the model to ingest. The two most common types of data preprocessing are normalization and standardization.

Normalization modifies the numerical columns within a dataset to standardize the scale. By rearranging the data within a dataset, the scaling method reduces duplication in which the numbers are scaled from 0–1. Because outliers are removed, undesirable characteristics from the dataset are also removed. When dealing with a significant amount of data, normalizing the dataset enhances the performance of a machine learning model in environments where feature distribution is unclear.

Standardization is a method designed to rescale the values of a dataset to meet the characteristics of a standard normal distribution. By using this methodology, the data can transmit more reliably across systems, making it simpler to process, analyze, and store data in a database. Standardization is beneficial when feature distribution is consistent and values on a scale aren’t constrained within a particular range.

Choose your model

Amazon Bedrock gives you the power of choice by providing a flexible and scalable environment that allows you to access and use multiple FMs from leading AI model providers. This flexibility enables you to select the most appropriate models for your specific use cases, whether you’re working on tasks like NLP, text generation, image generation, or other AI-driven applications.

Deploy the model

If you don’t already have access to Amazon Bedrock FMs, you’ll need to request access through the Amazon Bedrock console. Then you can use the Amazon Bedrock console to deploy the chosen model. Configure the deployment settings according to your application’s requirements.

Develop the backend application

Create a backend service to interact with the deployed model. This service will handle requests from the frontend, send data to the model, and process the model’s responses. You can use Lambda, API Gateway, or other preferred REST API endpoints.

Use the Amazon Bedrock API to send financial statements to the model and receive the analysis results.

The following is an example of the backend code.

Develop the frontend UI

Create a frontend interface for users to upload financial statements and view analysis results. This can be a web or mobile application. Make sure the frontend can send financial statement data to the backend service and display the analysis results.

Conclusion

In this post, we discussed the benefits to building a generative AI application powered by Amazon Bedrock to accelerate the analysis of financial documents. Stakeholders will be able to use AWS services to deploy and manage LLMs that help improve the efficiency of pulling insights from common documents like 10-Ks, balance sheets, and income statements.

For more information on working with generative AI on AWS, visit the AWS Skill Builder generative AI training modules.

For instructions on building frontend applications and full-stack applications powered by Amazon Bedrock, refer to Front-End Web & Mobile on AWS and Create a Fullstack, Sample Web App powered by Amazon Bedrock.


About the Author

Jason D’Alba is an AWS Solutions Architect leader focused on enterprise applications, helping customers architect highly available and scalable data & ai solutions.

Read More

Multilingual content processing using Amazon Bedrock and Amazon A2I

Multilingual content processing using Amazon Bedrock and Amazon A2I

The market size for multilingual content extraction and the gathering of relevant insights from unstructured documents (such as images, forms, and receipts) for information processing is rapidly increasing. The global intelligent document processing (IDP) market size was valued at $1,285 million in 2022 and is projected to reach $7,874 million by 2028 (source).

Let’s consider that you’re a multinational company that receives invoices, contracts, or other documents from various regions worldwide, in languages such as Arabic, Chinese, Russian, or Hindi. These languages might not be supported out of the box by existing document extraction software.

Anthropic’s Claude models, deployed on Amazon Bedrock, can help overcome these language limitations. These large language models (LLMs) are trained on a vast amount of data from various domains and languages. They possess remarkable capabilities in understanding and generating human-like text in multiple languages. Handling complex and sensitive documents requires accuracy, consistency, and compliance, often necessitating human oversight. Amazon Augmented AI (Amazon A2I) simplifies the creation of workflows for human review, managing the heavy lifting associated with developing these systems or overseeing a large reviewer workforce. By combining Amazon A2I and Anthropic’s Claude on Amazon Bedrock, you can build a robust multilingual document processing pipeline with improved accuracy and quality of extracted information.

To demonstrate this multilingual and validated content extraction solution, we will use Amazon Bedrock generative AI, serverless orchestration managed by Amazon Step Functions, and augmented human intelligence powered by Amazon A2I.

Solution overview

This post outlines a custom multilingual document extraction and content assessment framework using a combination of Anthropic’s Claude 3 on Amazon Bedrock and Amazon A2I to incorporate human-in-the-loop capabilities. The key steps of the framework are as follows:

  • Store documents of different languages
  • Invoke a processing flow that extracts data from the document according to given schema
  • Pass extracted content to human reviewers to validate the information
  • Convert validated content into an Excel format and store in a storage layer for use

This framework can be further expanded by parsing the content to a knowledge base, indexing the information extracted from the documents, and creating a knowledge discovery tool (Q&A assistant) to allow users to query information and extract relevant insights.

Document processing stages

Our reference solution uses a highly resilient pipeline, as shown in the following diagram, to coordinate the various document processing stages.

The document processing stages are:

  1. Acquisition – The first stage of the pipeline acquires input documents from Amazon Simple Storage Service (Amazon S3). In this stage, we store initial document information in an Amazon DynamoDB table after receiving an Amazon S3 event notification. We use this table to track the progression of this document across the entire pipeline.
  2. Extraction – A document schema definition is used to formulate the prompt and documents are embedded into the prompt and sent to Amazon Bedrock for extraction. Results are stored as JSON in a folder in Amazon S3.
  3. Custom business rules – Custom business rules are applied to the reshaped output containing information about tables in the document. Custom rules might include table format detection (such as detecting that a table contains invoice transactions) or column validation (such as verifying that a product code column only contains valid codes).
  4. Reshaping – JSON extracted in the previous step is reshaped in the format supported by Amazon A2I and prepared for augmentation.
  5. Augmentation – Human annotators use Amazon A2I to review the document and augment it with any information that was missed.
  6. Cataloging – Documents that pass human review are cataloged into an Excel workbook so your business teams can consume them.

A custom UI built with ReactJS is provided to human reviewers to intuitively and efficiently review and correct issues in the documents.

Extraction with a multi-modal language model

The architecture uses a multi-modal LLM to perform extraction of data from various multi-lingual documents. We specifically used the Rhubarb Python framework to extract JSON schema-based data from the documents. Rhubarb is a lightweight Python framework built from the ground up to enable document understanding tasks using multi-modal LLMs. It uses Amazon Bedrock through the Boto3 API to use Anthropic’s Claude V3 multi-modal language models, but makes it straightforward to use file formats that are otherwise not supported by Anthropic’s Claude models. As of writing, Anthropic’s Claude V3 models can only support image formats (JPEG, PNG, and GIF). This means that when dealing with documents in PDF or TIF format, the document must be converted to a compatible image format. This process is taken care by the Rhubarb framework internally, making our code simpler.

Additionally, Rhubarb comes with built-in system prompts that ground the model responses to be in a defined format using the JSON schema. A predefined JSON schema can be provided to the Rhubarb API, which makes sure the LLM generates data in that specific format. Internally, Rhubarb also does re-prompting and introspection to rephrase the user prompt in order to increase the chances of successful data extraction by the model. We used the following JSON schema for the purposes of extracting data from our documents:

{
    "type": "object",
    "properties": {
        "invoice_number": {
            "type": "string",
            "description": "The unique identifier for the invoice"
        },
        "issue_date": {
            "type": "string",
            "description": "The date the invoice was issued"
        },
        "due_date": {
            "type": "string",
            "description": "The date the payment for the invoice is due"
        },
        "issuer": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "string",
                    "description": "The name of the company or entity issuing the invoice"
                },
                "address": {
                    "type": "string",
                    "description": "The address of the issuing company or entity"
                },
                "identifier": {
                    "type": "string",
                    "description": "The identifier of the issuing company or entity"
                }
            },
            "required": [
                "name",
                "address",
                "identifier"
            ]
        },
        "recipient": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "string",
                    "description": "The name of the company or entity receiving the invoice"
                },
                "address": {
                    "type": "string",
                    "description": "The address of the receiving company or entity"
                },
                "identifier": {
                    "type": "string",
                    "description": "The identifier of the receiving company or entity"
                }
            },
            "required": [
                "name",
                "address",
                "identifier"
            ]
        },
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "product_id": {
                        "type": "string",
                        "description": "The identifier for the product or service"
                    },
                    "description": {
                        "type": "string",
                        "description": "A description of the product or service"
                    },
                    "quantity": {
                        "type": "number",
                        "description": "The quantity of the product or service"
                    },
                    "unit_price": {
                        "type": "number",
                        "description": "The price per unit of the product or service"
                    },
                    "discount": {
                        "type": "number",
                        "description": "The discount applied to the unit price"
                    },
                    "discounted_price": {
                        "type": "number",
                        "description": "The price per unit after discount"
                    },
                    "tax_rate": {
                        "type": "number",
                        "description": "The tax rate applied to the unit price"
                    },
                    "total_price": {
                        "type": "number",
                        "description": "The total price for the line item (quantity * unit_price)"
                    }
                },
                "required": [
                    "product_id",
                    "description",
                    "quantity",
                    "unit_price",
                    "discount",
                    "discounted_price",
                    "tax_rate",
                    "total_price"
                ]
            }
        },
        "totals": {
            "type": "object",
            "properties": {
                "subtotal": {
                    "type": "number",
                    "description": "The total of all line item prices before taxes and fees"
                },
                "discount": {
                    "type": "number",
                    "description": "The total discount applied"
                },
                "tax": {
                    "type": "number",
                    "description": "The amount of tax applied to the subtotal"
                },
                "total": {
                    "type": "number",
                    "description": "The total amount due for the invoice after taxes and fees"
                }
            },
            "required": [
                "subtotal",
                "discount",
                "tax",
                "total"
            ]
        }
    },
    "required": [
        "invoice_number",
        "issue_date",
        "due_date",
        "issuer",
        "recipient",
        "line_items",
        "totals"
    ]
}

There are a number of other features supported by Rhubarb; for example, it supports document classification, summary, page wise extractions, Q&A, streaming chat and summaries, named entity recognition, and more. Visit the Rhubarb documentation to learn more about using it for various document understanding tasks.

Prerequisites

This solution uses Amazon SageMaker labeling workforces to manage workers and distribute tasks. As a prerequisite, create a private workforce. For instructions, see Create an Amazon Cognito Workforce Using the Labeling Workforces Page. Create two worker teams, called primary and quality, and assign yourself to both teams.

After you add yourself to the teams and confirm your email, note the worker portal URL. To find the URL, open the AWS Management Console for SageMaker and choose Ground Truth and then Labeling workforces in the navigation pane. On the Private tab, you can find the URL for the labeling portal. This URL is also automatically emailed to the work team members as they are onboarded.

Next, install the AWS Cloud Development Kit (AWS CDK) toolkit with the following code:

npm install -g aws-cdk

Disclaimer: When installing global packages like the AWS CDK using npm, some systems, especially macOS and Linux, might require elevated permissions. If you encounter a permissions error when running npm install -g aws-cdk, you can adjust the global npm directory to avoid using sudo by following the instructions in this documentation.

Lastly, install Docker based on your operating system:

Deploy the application to the AWS Cloud

This reference solution is available on GitHub, and you can deploy it with the AWS CDK. For instructions on deploying the cloud application, see the README file in the GitHub repo.

Deploying this application to your AWS account will create various S3 buckets for document storage, AWS Lambda functions for integration with AWS machine learning (ML) services and business logic, AWS Identity and Access Management (IAM) policies, an Amazon Simple Queue Service (Amazon SQS) queue, a data processing pipeline using a Step Functions state machine, and an Amazon A2I based human review workflow.

Complete the following steps:

  1. Clone the GitHub repo.

To clone the repository, you can use either the HTTPS or SSH method depending on your environment and authentication setup:

Using HTTPS:

git clone https://github.com/aws-samples/multilingual-content-processing-with-amazon-bedrock.git

This option is generally accessible for most users who have their Git configuration set up for HTTPS

Using SSH:

git clone git@github.com:aws-samples/multilingual-content-processing-with-amazon-bedrock.git

Make sure you have your SSH keys properly configured and added to your GitHub account to use this method.

  1. Navigate to the root directory of the repository.
cd  multilingual-content-processing-with-amazon-bedrock
  1. Create a virtual environment.
python3 -m venv .venv
  1. Enter the virtual environment.
source .venv/bin/activate
  1. Install dependencies in the virtual environment.
pip install -r requirements.txt
  1. Bootstrap the AWS CDK (you only need to do this one time per account setup).
cdk bootstrap
  1. Edit the json file to add the name of the work team you created earlier. Make sure to match the work team name in the same AWS Region and account.
edit cdk.json
  1. Deploy the application.
cdk deploy --all

After you run cdk deploy --all, the AWS CloudFormation template provisions the necessary AWS resources.

Test the document processing pipeline

When the application is up and running, you’re ready to upload documents for processing and review. For this post, we use the following sample document for testing the pipeline. You can use the AWS Command Line Interface (AWS CLI) to upload the document, which will automatically invoke the pipeline.

  1. Upload the document schema.
aws s3 cp ./data/invoice_schema.json s3://mcp-store-document-<ACCOUNT-ID>/schema/
  1. Upload the documents.
aws s3 cp ./data/croatianinvoice.pdf s3://mcp-store-document-<ACCOUNT-ID>/acquire/
  1. The status of the document processing is tracked in a DynamoDB table. You can check the status on the DynamoDB console or by using the following query.
aws dynamodb query 
    --table-name mcp-table-pipeline 
    --key-condition-expression "DocumentID = :documentID" 
    --expression-attribute-values '{":documentID":{"S":"croatianinvoice.pdf"}}' 
    --output text

When the document reaches the Augment#Running stage, the extraction and business rule applications are complete, indicating that the document is ready for human review.

  1. Navigate to the portal URL that you retrieved earlier and log in to view all tasks pending human review.
  2. Choose Start working to examine the submitted document.

The interface will display the original document on the left and the extracted content on the right.

  1. When you complete your review and annotations, choose Submit.

The results will be stored as an Excel file in the mcp-store-document-<ACCOUNT-ID> S3 bucket in the /catalog folder.

The /catalog folder in your S3 bucket might take a few minutes to be created after you submit the job. If you don’t see the folder immediately, wait a few minutes and refresh your S3 bucket. This delay is normal because the folder is generated when the job is complete and the results are saved.

By following these steps, you can efficiently process, review, and store documents using a fully automated AWS Cloud-based pipeline.

Clean up

To avoid ongoing charges, clean up the entire AWS CDK environment by using the cdk destroy command. Additionally, it’s recommended to manually inspect the Lambda functions, Amazon S3 resources, and Step Functions workflow to confirm that they are properly stopped and deleted. This step is essential to avoid incurring any additional costs associated with running the AWS CDK application.

Furthermore, delete the output data created in the S3 buckets while running the orchestration workflow through the Step Functions and the S3 buckets themselves. You must delete the data in the S3 buckets before you can delete the buckets themselves.

Conclusion

In this post, we demonstrated an end-to-end approach for multilingual document ingestion and content extraction, using Amazon Bedrock and Amazon A2I to incorporate human-in-the-loop capabilities. This comprehensive solution enables organizations to efficiently process documents in multiple languages and extract relevant insights, while benefiting from the combined power of AWS AI/ML services and human validation.

Don’t let language barriers or validation challenges hold you back. Try this solution to take your content and insights to the next level to unlock the full potential of your data, and reach out to your AWS contact if you need further assistance. We encourage you to experiment editing the prompts and model versions to generate outputs that may get more closely aligned with your requirements.

For further information about Amazon Bedrock, check out the Amazon Bedrock workshop. To learn more about Step Functions, see Building machine learning workflows with Amazon SageMaker Processing jobs and AWS Step Functions.


About the Authors

Marin Mestrovic is a Partner Solutions Architect at Amazon Web Services, specializing in supporting partner solutions. In his role, he collaborates with leading Global System Integrators (GSIs) and independent software vendors (ISVs) to help design and build cost-efficient, scalable, industry-specific solutions. With his expertise in AWS capabilities, Marin empowers partners to develop innovative solutions that drive business growth for their clients.

Shikhar Kwatra is a Sr. Partner Solutions Architect at Amazon Web Services, working with leading Global System Integrators. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and support the GSI partners in building strategic industry solutions on AWS.

Dilin Joy is a Senior Partner Solutions Architect at Amazon Web Services. In his role, he works with leading independent software vendors (ISVs) and Global System Integrators (GSIs) to provide architectural guidance and support in building strategic industry solutions on the AWS platform. His expertise and collaborative approach help these partners develop innovative cloud-based solutions that drive business success for their clients.

Anjan Biswas is a Senior AI Services Solutions Architect who focuses on computer vision, NLP, and generative AI. Anjan is part of the worldwide AI services specialist team and works with customers to help them understand and develop solutions to business problems with AWS AI Services and generative AI.

Read More

Build a reverse image search engine with Amazon Titan Multimodal Embeddings in Amazon Bedrock and AWS managed services

Build a reverse image search engine with Amazon Titan Multimodal Embeddings in Amazon Bedrock and AWS managed services

In ecommerce, visual search technology revolutionizes how customers find products by enabling them to search for products using images instead of text. Shoppers often have a clear visual idea of what they want but struggle to describe it in words, leading to inefficient and broad text-based search results. For example, searching for a specific red leather handbag with a gold chain using text alone can be cumbersome and imprecise, often yielding results that don’t directly match the user’s intent. By using images, visual search can directly match physical attributes, providing better results quickly and enhancing the overall shopping experience.

A reverse image search engine enables users to upload an image to find related information instead of using text-based queries. It works by analyzing the visual content to find similar images in its database. Companies such as Amazon use this technology to allow users to use a photo or other image to search for similar products on their ecommerce websites. Other companies use it to identify objects, faces, and landmarks to discover the original source of an image. Beyond ecommerce, reverse image search engines are invaluable to law enforcement for identifying illegal items for sale and identifying suspects, to publishers for validating visual content authenticity, for healthcare professionals by assisting in medical image analysis, and tackling challenges such as misinformation, copyright infringement, and counterfeit products.

In the context of generative AI, significant progress has been made in developing multimodal embedding models that can embed various data modalities—such as text, image, video, and audio data—into a shared vector space. By mapping image pixels to vector embeddings, these models can analyze and compare visual attributes such as color, shape, and size, enabling users to find similar images with specific attributes, leading to more precise and relevant search results.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. The Amazon Bedrock single API access, regardless of the models you choose, gives you the flexibility to use different FMs and upgrade to the latest model versions with minimal code changes.

Exclusive to Amazon Bedrock, the Amazon Titan family of models incorporates 25 years of experience innovating with AI and machine learning at Amazon. Amazon Titan FMs provide customers with a breadth of high-performing image, multimodal, and text model choices, through a fully managed API. With Amazon Titan Multimodal Embeddings, you can power more accurate and contextually relevant multimodal search, recommendation, and personalization experiences for users.

In this post, you will learn how to extract key objects from image queries using Amazon Rekognition and build a reverse image search engine using Amazon Titan Multimodal Embeddings from Amazon Bedrock in combination with Amazon OpenSearch Serverless Service.

Solution overview

The solution outlines how to build a reverse image search engine to retrieve similar images based on input image queries. This post demonstrates a guide for using Amazon Titan Multimodal Embeddings to embed images, store these embeddings in an OpenSearch Serverless vector index, and use Amazon Rekognition to extract key objects from images for querying the index.

The following diagram illustrates the solution architecture:

Architecture of solutionThe steps of the solution include:

  1. Upload data to Amazon S3: Store the product images in Amazon Simple Storage Service (Amazon S3).
  2. Generate embeddings: Use Amazon Titan Multimodal Embeddings to generate embeddings for the stored images.
  3. Store embeddings: Ingest the generated embeddings into an OpenSearch Serverless vector index, which serves as the vector database for the solution.
  4. Image analysis: Use Amazon Rekognition to analyze the product images and extract labels and bounding boxes for these images. These extracted objects will then be saved as separate images, which can be used for the query.
  5. Convert search query to an embedding: Convert the user’s image search query into an embedding using Amazon Titan Multimodal Embeddings.
  6. Run similarity search: Perform a similarity search on the vector database to find product images that closely match the search query embedding.
  7. Display results: Display the top K similar results to the user.

Prerequisites

To implement the proposed solution, make sure that you have the following:

Request model access

  • An Amazon SageMaker Studio domain. If you haven’t set up a SageMaker Studio domain, see this Amazon SageMaker blog post for instructions on setting up SageMaker Studio for individual users.
  • An Amazon OpenSearch Serverless collection. You can create a vector search collection by following the steps in Create a collection with public network access and data access granted to the Amazon SageMaker Notebook execution role principal.
  • The GitHub repo cloned to the Amazon SageMaker Studio instance. To clone the repo onto your SageMaker Studio instance, choose the Git icon on the left sidebar and enter https://github.com/aws-samples/reverse-image-search-engine.git
  • After it has cloned, you can navigate to the reverse-image-search-engine.ipynb notebook file to and run the cells. This post highlights the important code segments; however, the full code can be found in the notebook.
  • The necessary permissions attached to the Amazon SageMaker notebook execution role to grant read and write access to the Amazon OpenSearch Serverless collection. For more information on managing credentials securely, see the AWS Boto3 documentation. Make sure that full access is granted to the SageMaker execution role by applying the following IAM policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "aoss:*",
            "Resource": "*"
        }
    ]
}

Upload the dataset to Amazon S3

In this solution, we will use the Shoe Dataset from Kaggle.com, which contains a collection of approximately 1,800 shoe images. The dataset is primarily used for image classification use cases and contains images of shoes from six main categories—boots, sneakers, flip flops, loafers, sandals, and soccer shoes—with 249 JPEG images for each shoe type. For this tutorial, you will concentrate on the loafers folder found in the training category folder.

To upload the dataset

  1. Download the dataset: Go to the Shoe Dataset page on Kaggle.com and download the dataset file (350.79MB) that contains the images.
  2. Extract the specific folder: Extract the downloaded file and navigate to the loafers category within the training 
  3. Create an Amazon S3 bucket: Sign in to the Amazon S3 console, choose Create bucket, and follow the prompts to create a new S3 bucket.
  4. Upload images to the Amazon S3 bucket using the AWS CLI: Open your terminal or command prompt and run the following command to upload the images from the loafers folder to the S3 bucket:
    aws s3 cp </path/to/local/folder> s3://<your-bucket-name>/ --recursive

Replace </path/to/local/folder> with the path to the loafers category folder from the training folder on your local machine. Replace <your-bucket-name> with the name of your S3 bucket. For example:
aws s3 cp /Users/username/Documents/training/loafers s3://footwear-dataset/ --recursive

  1. Confirm the upload: Go back to the S3 console, open your bucket, and verify that the images have been successfully uploaded to the bucket.

Create image embeddings

Vector embeddings represent information—such as text or images—as a list of numbers, with each number capturing specific features. For example, in a sentence, some numbers might represent the presence of certain words or topics, while in an image or video, they might represent colors, shapes, or patterns. This numerical representation, or vector, is placed in a multidimensional space called the embedding space, where distances between vectors indicate similarities between the represented information. The closer vectors are to one another in this space, the more similar the information they represent is. The following figure is an example of an image and part of its associated vector.

Example of image embedding

To convert images to vectors, you can use Amazon Titan Multimodal Embeddings to generate image embeddings, which can be accessed through Amazon Bedrock. The model will generate vectors embeddings with 1,024 dimensions; however, you can choose a smaller dimension size to optimize for speed and performance.

To create image embeddings:

  1. The following code segment shows how to create a function that will be used to generate embeddings for the dataset of shoe images stored in the S3 bucket.
    # Import required libraries
    import boto3
    import pandas as pd
    import base64
    import json
    
    # Constants, change to your S3 bucket name and selected AWS region
    BUCKET_NAME = "<YOUR_AMAZON_S3_BUCKET_NAME>"
    BEDROCK_MODEL_ID = "amazon.titan-embed-image-v1"
    REGION = "<YOUR_SELECTED_AWS_REGION>"
    # Define max width and height for resizing to accommodate Bedrock limits
    MAX_WIDTH = 1024  
    MAX_HEIGHT = 1024  
    
    # Initialize AWS clients
    s3 = boto3.client('s3')
    bedrock_client = boto3.client(
        "bedrock-runtime", 
        REGION, 
        endpoint_url=f"https://bedrock-runtime.{REGION}.amazonaws.com"
    )
    
    # Function to resize image
    def resize_image(image_data):
        image = Image.open(io.BytesIO(image_data))
    
        # Resize image while maintaining aspect ratio
        image.thumbnail((MAX_WIDTH, MAX_HEIGHT))
    
        # Save resized image to bytes buffer
        buffer = io.BytesIO()
        image.save(buffer, format="JPEG")
        buffer.seek(0)
    
        return buffer.read()
    
    # Function to create embedding from input image
    def create_image_embedding(image):
        image_input = {}
    
        if image is not None:
            image_input["inputImage"] = image
        else:
            raise ValueError("Image input is required")
    
        image_body = json.dumps(image_input)
    
        # Invoke Amazon Bedrock with encoded image body
        bedrock_response = bedrock_client.invoke_model(
            body=image_body,
            modelId=BEDROCK_MODEL_ID,
            accept="application/json",
            contentType="application/json"
        )
    
        # Retrieve body in JSON response
        final_response = json.loads(bedrock_response.get("body").read())
    
        embedding_error = final_response.get("message")
    
        if embedding_error is not None:
            print (f"Error creating embeddings: {embedding_error}")
    
        # Return embedding value
        return final_response.get("embedding")

  2. Because you will be performing a search for similar images stored in the S3 bucket, you will also have to store the image file name as metadata for its embedding. Also, because the model expects a base64 encoded image as input, you will have to create an encoded version of the image for the embedding function. You can use the following code to fulfill both requirements.
    # Retrieve images stored in S3 bucket 
    response = s3.list_objects_v2(Bucket=BUCKET_NAME)
    contents = response.get('Contents', [])
    
    # Define arrays to hold embeddings and image file key names
    image_embeddings = []
    image_file_names = []
    
    # Loop through S3 bucket to encode each image, generate its embedding, and append to array
    for obj in contents:
        image_data = s3.get_object(Bucket=BUCKET_NAME, Key=obj['Key'])['Body'].read()
    
        # Resize the image to meet model requirements
        resized_image = resize_image(image_data)
    
        # Create base64 encoded image for Titan Multimodal Embeddings model input
        base64_encoded_image = base64.b64encode(resized_image).decode('utf-8')
    
        # Generate the embedding for the resized image
        image_embedding = create_image_embedding(image=base64_encoded_image)
        image_embeddings.append(image_embedding)
        image_file_names.append(obj["Key"])

  3. After generating embeddings for each image stored in the S3 bucket, the resulting embedding list can be obtained by running the following code
    # Add and list embeddings with associated image file key to dataframe object
    final_embeddings_dataset = pd.DataFrame({'image_key': image_file_names, 'image_embedding': image_embeddings})
    final_embeddings_dataset.head()

image_key image_embedding
image1.jpeg [0.00961759, 0.0016261627, -0.0024508594, -0.0…
image10.jpeg [0.008917685, -0.0013863152, -0.014576114, 0.0…
image100.jpeg [0.006402869, 0.012893448, -0.0053941975, -0.0…
image101.jpg [0.06542923, 0.021960363, -0.030726435, -0.000…
image102.jpeg [0.0134112835, -0.010299515, -0.0044046864, -0…

Upload embeddings to Amazon OpenSearch Serverless

Now that you have created embeddings for your images, you need to store these vectors so they can be searched and retrieved efficiently. To do so, you can use a vector database.

A vector database is a type of database designed to store and retrieve vector embeddings. Each data point in the database is associated with a vector that encapsulates its attributes or features. This makes it particularly useful for tasks such as similarity search, where the goal is to find objects that are the most similar to a given query object. To search against the database, you can use a vector search, which is performed using the k-nearest neighbors (k-NN) algorithm. When you perform a search, the algorithm computes a similarity score between the query vector and the vectors of stored objects using methods such as cosine similarity or Euclidean distance. This enables the database to retrieve the closest objects that are most similar to the query object in terms of their features or attributes. Vector databases often use specialized vector search engines, such as nmslib or faiss, which are optimized for efficient storage, retrieval, and similarity calculation of vectors.

In this post, you will use OpenSearch Serverless as the vector database for the image embeddings. OpenSearch Serverless is a serverless option for OpenSearch Service, a powerful storage option built for distributed search and analytics use cases. With Amazon OpenSearch Serverless, you don’t need to provision, configure, and tune the instance clusters that store and index your data.

To upload embeddings:

  1. If you have set up your Amazon OpenSearch Serverless collection, the next step is to create a vector index. In the Amazon OpenSearch Service console, choose Serverless Collections, then select your collection.
  2. Choose Create vector index.

Create vector index in OpenSearch Collection

  1. Next, create a vector field by entering a name, defining an engine, and adding the dimensions, and search configurations.
    1. Vector field name: Enter a name, such as vector.
    2. Engine: Select nmslib.
    3. Dimensions: Enter 1024.
    4. Distance metric: Select Euclidean.
    5. Choose Confirm.

  1. To tag each embedding with the image file name, you must also add a mapping field under Metadata management.
    1. Mapping field: Enter image_file.
    2. Data type: Select String.
    3. Filterable: Select True.
    4. Choose Create to create the index.

Review and confirm vector index creation

  1. Now that the vector index has been created, you can ingest the embeddings. To do so, run the following code segment to connect to your Amazon OpenSearch Serverless collection.
# Import required libraries to connect to Amazon OpenSearch Serverless connection
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

# Initialize endpoint name constant
HOST = "<YOUR_HOST_ENDPOINT_NAME>" # For example, abcdefghi.us-east-1.aoss.amazonaws.com (without https://)

# Initialize and authenticate with the OpenSearch client
credentials = boto3.Session().get_credentials()
auth = AWS4Auth(credentials.access_key, credentials.secret_key, REGION, 'aoss', session_token=credentials.token)
client = OpenSearch(
    hosts=[{'host': HOST, 'port': 443}],
    http_auth=auth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    pool_maxsize=300
)

  1. After connecting, you can ingest your embeddings and the associated image key for each vector as shown in the following code.
# Import required library to iterate through dataset
import tqdm.notebook as tq

INDEX_NAME = "<YOUR_VECTOR_INDEX_NAME>"
VECTOR_NAME = "<YOUR_VECTOR_FIELD_NAME>"
VECTOR_MAPPING = "<YOUR_MAPPING_FIELD_NAME>"

# Ingest embeddings into vector index with associate vector and text mapping fields
for idx, record in tq.tqdm(final_embeddings_dataset.iterrows(), total=len(final_embeddings_dataset)):
    body = {
        VECTOR_NAME: record['image_embedding'],
        VECTOR_MAPPING: record['image_key']
    }
    response = client.index(index=INDEX_NAME, body=body)

Use Amazon Rekognition to extract key objects

Now that the embeddings have been created, use Amazon Rekognition to extract objects of interest from your search query. Amazon Rekognition analyzes images to identify objects, people, text, and scenes by detecting labels and generating bounding boxes. In this use case, Amazon Rekognition will be used to detect shoe labels in query images.

To view the bounding boxes around your respective images, run the following code. If you want to apply this to your own sample images, make sure to specify the labels you want to identify. Upon completion of the bounding box and label generation, the extracted objects will be saved in your local directory in the SageMaker Notebook environment.

# Import required libraries to draw bounding box on image
from PIL import Image, ImageDraw, ImageFont

# Function to draw bounding boxes and extract labeled objects
def process_image(image_path, boxes, labels):
    # Load the image
    image = Image.open(image_path)
    
    # Convert RGBA to RGB if necessary
    if image.mode == 'RGBA':
        image = image.convert('RGB')
    
    draw = ImageDraw.Draw(image)
    
    # Font for the label
    try:
        font = ImageFont.truetype("arial.ttf", 15)
    except IOError:
        font = ImageFont.load_default()

    # Counter for unique filenames
    crop_count = 1 
    
    # Draw bounding boxes around specific label of interest (ex. shoe) and extract labeled objects
    for box, label in zip(boxes, labels):
    
        # Change to specific label you are looking to extract
        if label not in "Shoe":
            continue
        
        # Box coordinates
        left = int(image.width * box['Left'])
        top = int(image.height * box['Top'])
        right = left + int(image.width * box['Width'])
        bottom = top + int(image.height * box['Height'])
            
        # Crop the image to the bounding box
        cropped_image = image.crop((left, top, right, bottom))
    
        # Draw label on the cropped image
        cropped_draw = ImageDraw.Draw(cropped_image)
    
        # File name for the output
        file_name = f"extract_{crop_count}.jpg"
        # Save extracted object image locally
        cropped_image.save(file_name)
        print(f"Saved extracted object image: {file_name}")
        crop_count += 1
    
    # Save or display the image with bounding boxes
    image.show()

The following image shows the outputted image with the respective labels within the bounding boxes:

Embed object image

Now that the object of interest within the image has been extracted, you need to generate an embedding for it so that it can be searched against the stored vectors in the Amazon OpenSearch Serverless index. To do so, find the best extracted image in the local directory created when the images were downloaded. Ensure the image is unobstructed, high-quality, and effectively encapsulates the features that you’re searching for. After you have identified the best image, paste its file name as shown in the following code.

# Open the extracted object image file in binary mode
# Paste your extracted image from the local download directory in the notebook below
with open("<YOUR_LOCAL_EXTRACTED_IMAGE (ex. extract_1.jpg)>", "rb") as image_file:
    base64_encoded_image = base64.b64encode(image_file.read()).decode('utf-8')

# Embed the extracted object image
object_embedding = create_image_embedding(image=base64_encoded_image)

# Print the first few numbers of the embedding followed by ...
print(f"Image embedding: {object_embedding[:5]} ...")

Perform a reverse image search

With the embedding of the extracted object, you can now perform a search against the Amazon OpenSearch Serverless vector index to retrieve the closest matching images, which is performed using the k-NN algorithm. When you created your vector index earlier, you defined the similarity between vector distances to be calculated using the Euclidian metric with the nmslib engine. With this configuration, you can define the number of results to retrieve from the index and invoke the Amazon OpenSearch Service client with a search request as shown in the following code.

# Define number of images to search and retrieve
K_SEARCHES = 3

# Define search configuration body for K-NN 
body = {
        "size": K_SEARCHES,
        "_source": {
            "exclude": [VECTOR_NAME],
        },
        "query": {
            "knn": {
                "vectors": {
                    "vector": object_embedding,
                    "k": K_SEARCHES,
                }
            }
        },
        "_source": True,
        "fields": [VECTOR_MAPPING],
    }

# Invoke OpenSearch to search through index with K-NN configurations
knn_response = client.search(index=INDEX_NAME, body=body)
result = []
scores_tracked = set()  # Set to keep track of already retrieved images and their scores

# Loop through response to print the closest matching results
for hit in knn_response["hits"]["hits"]:
    id_ = hit["_id"]
    score = hit["_score"]
    item_id_ = hit["_source"][VECTOR_MAPPING]

    # Check if score has already been tracked, if not, add it to final result
    if score not in scores_tracked:
        final_item = [item_id_, score]
        result.append(final_item)
        scores_tracked.add(score)  # Log score as tracked already

# Print Top K closest matches
print(f"Top {K_SEARCHES} closest embeddings and associated scores: {result}")

Because the preceding search retrieves the file names that are associated with the closest matching vectors, the next step is to fetch each specific image to display the results. This can be accomplished by downloading the specific image from the S3 bucket to a local directory in the notebook, then displaying each one sequentially. Note that if your images are stored within a subdirectory in the bucket, you might need to add the appropriate prefix to the bucket path as shown in the following code.

import os

# Function to display image
def display_image(image_path):
    image = Image.open(image_path)
    image.show()
    
# List of image file names from the K-NN search
image_files = result

# Create a local directory to store downloaded images
download_dir = 'RESULTS'

# Create directory if not exists
os.makedirs(download_dir, exist_ok=True)

# Download and display each image that matches image query
for file_name in image_files:
    print("File Name: " + file_name[0])
    print("Score: " + str(file_name[1]))
    local_path = os.path.join(download_dir, file_name[0])
    # Ensure to add in the necessary prefix before the file name if files are in subdirectories in the bucket
    # ex. s3.download_file(BUCKET_NAME, "training/loafers/"+file_name[0], local_path)
    s3.download_file(BUCKET_NAME, file_name[0], local_path)
    # Open downloaded image and display it
    display_image(local_path)
    print()

The following images show the results for the closest matching products in the S3 bucket related to the extracted object image query:

First match:
File Name: image17.jpeg
Score: 0.64478767
Image of first match from search

Second match:
File Name: image209.jpeg
Score: 0.64304984
Image of second match from search

Third match:
File Name: image175.jpeg
Score: 0.63810235
Image of third match from search

Clean up

To avoid incurring future charges, delete the resources used in this solution.

  1. Delete the Amazon OpenSearch Collection vector index.
  2. Delete the Amazon OpenSearch Serverless collection.
  3. Delete the Amazon SageMaker resources.
  4. Empty and delete the Amazon S3 bucket.

Conclusion

By combining the power of Amazon Rekognition for object detection and extraction, Amazon Titan Multimodal Embeddings for generating vector representations, and Amazon OpenSearch Serverless for efficient vector indexing and search capabilities, you successfully created a robust reverse image search engine. This solution enhances product recommendations by providing precise and relevant results based on visual queries, thereby significantly improving the user experience for ecommerce solutions.

For more information, see the following resources:


About the Authors

Nathan Pogue is a Solutions Architect on the Canadian Public Sector Healthcare and Life Sciences team at AWS. Based in Toronto, he focuses on empowering his customers to expand their understanding of AWS and utilize the cloud for innovative use cases. He is particularly passionate about AI/ML and enjoys building proof-of-concept solutions for his customers.

Waleed Malik is a Solutions Architect with the Canadian Public Sector EdTech team at AWS. He holds six AWS certifications, including the Machine Learning Specialty Certification. Waleed is passionate about helping customers deepen their knowledge of AWS by translating their business challenges into technical solutions.

Read More

Toward modular models: Collaborative AI development enables model accountability and continuous learning

Toward modular models: Collaborative AI development enables model accountability and continuous learning

Modular Models blog hero

Today, development of generalizable AI models requires access to sufficient data and compute resources, which may create challenges for some researchers. Democratizing access to technology across the research community can advance the development of generalizable AI models. By applying the core software development concept of modularity to AI, we can build models that are powerful, efficient, adaptable, and transparent. 

Until recently, AI models were primarily built using monolithic architecture. Though powerful, these models can be challenging to customize and edit compared to modular models with easily interpretable functional components. Today, developers employ modularity to make services more reliable, faster to refine, and easier for multiple users to contribute to simultaneously. One promising research direction that supports this involves shifting AI development towards a modular approach (opens in new tab), which could enhance flexibility and improve scalability. 

One such approach is to use numerous fine-tuned models designed for specific tasks, known as expert models, and coordinate them to solve broader tasks (see Towards Modular LLMs by Building and Reusing a Library of LoRAs – Microsoft Research (opens in new tab)Learning to Route Among Specialized Experts for Zero-Shot Generalization (opens in new tab)). These expert models can be developed in a decentralized way. Similar to the benefits of using a microservice architecture, this modular AI approach can be more flexible, cheaper to develop, and more compliant with relevant privacy and legal policies. However, while substantial research has been done on training optimization, coordination methods remain largely unexplored.

Our team is exploring the potential of modular models by focusing on two themes: i) optimizing the training of expert models and ii) refining how expert models coordinate to form a collaborative model. One method for coordinating expert models is to adaptively select the most relevant independently developed expert models for specific tasks or queries. This approach, called MoErging, is similar to Mixture-of-Experts (MoE) approaches but differs in that the routing mechanism is learned after the individual experts are trained. As an initial step, we contributed to creating a taxonomy for organizing recent MoErging methods with the goal of helping establish a shared language for the research community and facilitating easier and fairer comparisons between different methods. 

Assessing existing MoErging methods

Most MoErging methods were developed within the past year, so they don’t reference each and are difficult to compare. To enable comparison of MoErging methods, we recently collaborated on a survey that establishes a taxonomy for comparing methods and organizes MoErging design choices into three steps: 

  • Expert design: Identifies and uses expert models trained asynchronously by distributed contributors. 
  • Routing design: Routes tasks to the appropriate expert models. 
  • Application design: Applies the merged models to specific tasks or domains. 

Each step is broken down into more detailed choices. For example, in expert design, expert training can be custom or standard, and training data can be private or shared. Custom training requires MoErging to have specific training procedures, while the standard training does not. Similarly, shared data means that the training data must be accessible for routing. Otherwise, the training data is considered private. 

The benefits of modular models discussed below assume that training data doesn’t need to be shared. However, a review of current MoErging methods finds that some approaches do require sharing training data, making certain benefits no longer applicable. 

Spotlight: Blog post

Research Focus: Week of September 9, 2024

Investigating vulnerabilities in LLMs; A novel total-duration-aware (TDA) duration model for text-to-speech (TTS); Generative expert metric system through iterative prompt priming; Integrity protection in 5G fronthaul networks.


The survey evaluates 29 different MoErging methods using its taxonomy, which categorizes the design choices into two expert design choices, five routing design choices, and two application design options, shown in Figure 1.

Taxonomy of model MoErging design choices. References in the leaf noes link to sections for specific papers that make some particular design choice. We omit references to methods for which a given choice is not applicable.
Figure 1: Taxonomy of model MoErging design choices. References in the leaf nodes link to sections of specific papers that implement each choice. We omit references to methods where a particular choice is not applicable. 

One takeaway from the survey is that most MoErging methods can be grouped into four categories based on their routing design choices:

  1. Classifier-based routing: Methods that train the router as a classifier using expert datasets or unseen data. 
  2. Embedding-based routing: Methods that compute embeddings of expert training sets and compare them to a query embedding for routing. 
  3. Nonrouter methods: Methods that do not explicitly train a router but instead initialize the router in an unsupervised manner.  
  4. Task-specific routing: Methods that learn a task-specific routing distribution over the target dataset to improve performance on a specific task. 

While the differences within each category are minor, the differences across categories are significant because they determine the level of data access required for implementation. As a result, data access is a primary factor in determining which methods are applicable and feasible in various settings. 

Our taxonomy also covers recent approaches to building agentic systems, which could be viewed as specific types of MoErging methods where experts are full language models and routing decisions are made on a step-by-step or example-by-example basis. The optimal level for MoErging may vary depending on the task and the computational resources available to each stakeholder. 

Potential benefits and use cases of modular models 

Modular models can unlock new benefits and use cases for AI, offering a promising approach to addressing challenges in current AI development. Moving forward, further substantial research is needed to validate this potential and assess feasibility.  

Modular AI may: 

  • Allow privacy-conscious contributions.  Teams with sensitive or proprietary data, such as personally identifiable information (PII) and copyrighted content, can contribute expert models and benefit from larger projects without sharing their data. This capacity can make it easier to comply with data privacy and legal standards, which could be valuable for healthcare teams that would benefit from general model capabilities without combining their sensitive data with other training data. 
  • Drive model transparency and accountability.  Modular models allow specific expert models to be identified and, if necessary, removed or retrained. For example, if a module trained on PII, copyrighted, or biased data is identified, it can be removed more easily, eliminating the need for retraining and helping ensure compliance with privacy and ethical standards. 
  • Facilitate model extensibility and continual improvement. Modularity supports continual improvements, allowing new capabilities from expert models to be integrated as they are available. This approach is akin to making localized edits, allowing for continuous, cost-effective improvement. 
  • Lower the barrier to AI development for those with limited compute and data resources. Modular AI can reduce the need for extensive data and compute by creating a system where pretrained experts can be reused, benefiting academics, startups, and teams focused on niche use cases. For example, an AI agent tasked with booking flights on a specific website with limited training data could leverage general navigation and booking skills from other trained AI experts, enabling generalizable and broadly applicable skills without requiring domain-specific training data. We explore this process of transferring skills across tasks in our paper “Multi-Head Routing For Cross-Task Generalization.” 
  • Support personalization.  Modular models make it possible to equip AI agents with experts tailored to individual users or systems. For instance, AI designed to emulate five-time World Chess Champion Magnus Carlsen could enhance a player’s preparation to play a match against him. Experiments suggest that storing knowledge or user profiles in on-demand modules can match or surpass the performance of retrieval-augmented generation (RAG), potentially reducing latency and improving the user’s experience in custom AI applications. 

Current limitations and looking forward 

In this blog, we focused on a type of modular approach that involves training foundation models, which requires substantial compute power and large amounts of data. Despite the advantages of modularity, such as increased flexibility, efficiency, and adaptability, the development of foundation models remains resource-intensive, necessitating high-performance computing and robust datasets to support fine-tuning.  

Recent work has begun to address these challenges by distributing the pretraining process of foundation models (opens in new tab). Looking ahead, a promising research direction focuses on exploring how to create a minimal dataset for training “empty foundation models” while shifting most of their capabilities to external pluggable modules. 

Modular methods are evolving rapidly, and we’re excited by their potential. Modularity has the capacity to democratize AI development, improve model accountability, and support efficient continuous learning. With the MoErging taxonomy, we aim to establish a shared language that fosters engagement within the research community. This research is in the early stages, and we welcome community collaboration. If you’re interested in working with us, please reach out to ModularModels@microsoft.com

Acknowledgements

We would like to thank paper collaborators: Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Edoardo Ponti, Zhan Su, Matheus Pereira, Nicolas Le Roux, Nabil Omi, Siddhartha Sen, Anurag Sarkar, Jordan T. Ash, Oleksiy Ostapenko, and Laurent Charlin.

The post Toward modular models: Collaborative AI development enables model accountability and continuous learning appeared first on Microsoft Research.

Read More

Research Focus: Week of November 11, 2024

Research Focus: Week of November 11, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: Week of November 11, 2024

Look Ma, no markers: holistic performance capture without the hassle

Motion-capture technologies used in film and game production typically focus solely on face, body, or hand capture, requiring complex and expensive hardware and lots of manual intervention from skilled operators. While machine-learning-based approaches can overcome these challenges, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts.

In a recent paper: Look Ma, no markers: holistic performance capture without the hassle, researchers from Microsoft introduce a technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. This approach produces stable world-space results from arbitrary camera rigs while also supporting varied capture environments and clothing. The researchers achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. They evaluate their method on a number of body, face, and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets. 


Building AI Agents for Autonomous Clouds: Challenges and Design Principles

Using AI agents for operational resilience of cloud services, which currently require significant human effort and domain knowledge, is a high-impact application. Interest is growing in AI for IT Operations (AIOps), which aims to automate complex operational tasks like fault localization and root cause analysis, thereby reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds though AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents.  

In a recent paper: Building AI Agents for Autonomous Clouds: Challenges and Design Principles, researchers from Microsoft lay the groundwork for such a framework by first framing the requirements and then discussing design decisions that satisfy them. The researchers also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. The paper sets the stage for building a modular and robust framework for building, evaluating, and improving agents for autonomous clouds. 

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

AI-assisted programming offers great promise, but also raises concerns around the trustworthiness of AI-generated code. Proof-oriented languages like F* (opens in new tab) enable authoring programs backed by machine-checked proofs of correctness. Using AI to generate code and proofs in proof-oriented languages helps mitigate these concerns, while also making proof-oriented programming more accessible to people. 

In a recent preprint: Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming, researchers from Microsoft and external colleagues explore using AI to automate the construction of proof-oriented programs. The researchers curate a dataset of 940,000 lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. The dataset includes around 54,000 top-level F* definitions, each representing a type-directed program and proof synthesis problem. A program fragment checker queries F* to check the correctness of candidate solutions. With this dataset, the researchers explore using AI to synthesize programs and their proofs in F*, finding the performance of fine-tuned smaller language models to compare favorably with LLMs, at much lower computational cost.


One-to-many testing for code generation from (just) natural language

The mostly basic Python programs (MBPP) dataset is commonly used for evaluating natural language models on the task of code generation. Despite its popularity, the original MBPP has two major problems: it relies on providing test cases to generate the right signature and there is poor alignment between “what is asked” and “what is evaluated” using the test cases. 

To address these challenges, in their recent “One-to-many testing for code generation from (just) natural language” paper, researchers from Microsoft introduce the “mostly basic underspecified Python programs” or MBUPP dataset. This dataset adapts MBPP to emphasize the natural language aspect by allowing for some syntactic ambiguity (like not specifying the return type of a function) and evaluating generated code on multiple sets of assertions (like each set covering a different return type). Besides iteratively inspecting LLM results to extend the assertions sets, the researchers carefully remove poor alignment from the instructions (like a specific algorithm to use) and perform a majority vote over slightly paraphrased instructions to improve the quality of the dataset. The researchers compare popular open and closed weight models on the original MBPP and adapted MBUPP datasets to highlight the effect of paraphrasing and new test cases on code generation evaluation.  The MBUPP dataset is publicly available to encourage its use in evaluation code generation models.


The post Research Focus: Week of November 11, 2024 appeared first on Microsoft Research.

Read More

2025 Predictions: AI Finds a Reason to Tap Industry Data Lakes

2025 Predictions: AI Finds a Reason to Tap Industry Data Lakes

Since the advent of the computer age, industries have been so awash in stored data that most of it never gets put to use.

This data is estimated to be in the neighborhood of 120 zettabytes — the equivalent of trillions of terabytes, or more than 120x the amount of every grain of sand on every beach around the globe. Now, the world’s industries are putting that untamed data to work by building and customizing large language models (LLMs).

As 2025 approaches, industries such as healthcare, telecommunications, entertainment, energy, robotics, automotive and retail are using those models, combining it with their proprietary data and gearing up to create AI that can reason.

The NVIDIA experts below focus on some of the industries that deliver $88 trillion worth of goods and services globally each year. They predict that AI that can harness data at the edge and deliver near-instantaneous insights is coming to hospitals, factories, customer service centers, cars and mobile devices near you.

But first, let’s hear AI’s predictions for AI. When asked, “What will be the top trends in AI in 2025 for industries?” both Perplexity and ChatGPT 4.0 responded that agentic AI sits atop the list alongside edge AI, AI cybersecurity and AI-driven robots.

Agentic AI is a new category of generative AI that operates virtually autonomously. It can make complex decisions and take actions based on continuous learning and analysis of vast datasets. Agentic AI is adaptable, has defined goals and can correct itself, and can chat with other AI agents or reach out to a human for help.

Now, hear from NVIDIA experts on what to expect in the year ahead:

Kimberly Powell
Vice President of Healthcare

Human-robotic interaction: Robots will assist human clinicians in a variety of ways, from understanding and responding to human commands, to performing and assisting in complex surgeries.

It’s being made possible by digital twins, simulation and AI that train and test robotic systems in virtual environments to reduce risks associated with real-world trials. It also can train robots to react in virtually any scenario, enhancing their adaptability and performance across different clinical situations.

New virtual worlds for training robots to perform complex tasks will make autonomous surgical robots a reality. These surgical robots will perform complex surgical tasks with precision, reducing patient recovery times and decreasing the cognitive workload for surgeons.

Digital health agents: The dawn of agentic AI and multi-agent systems will address the existential challenges of workforce shortages and the rising cost of care.

Administrative health services will become digital humans taking notes for you or making your next appointment — introducing an era of services delivered by software and birthing a service-as-a-software industry.

Patient experience will be transformed with always-on, personalized care services while healthcare staff will collaborate with agents that help them reduce clerical work, retrieve and summarize patient histories, and recommend clinical trials and state-of-the-art treatments for their patients.

Drug discovery and design AI factories: Just as ChatGPT can generate an email or a poem without putting a pen to paper for trial and error, generative AI models in drug discovery can liberate scientific thinking and exploration.

Techbio and biopharma companies have begun combining models that generate, predict and optimize molecules to explore the near-infinite possible target drug combinations before going into time-consuming and expensive wet lab experiments.

The drug discovery and design AI factories will consume all wet lab data, refine AI models and redeploy those models — improving each experiment by learning from the previous one. These AI factories will shift the industry from a discovery process to a design and engineering one.

Rev Lebaredian
Vice President of Omniverse and Simulation Technology

Let’s get physical (AI, that is): Getting ready for AI models that can perceive, understand and interact with the physical world is one challenge enterprises will race to tackle.

While LLMs require reinforcement learning largely in the form of human feedback, physical AI needs to learn in a “world model” that mimics the laws of physics. Large-scale physically based simulations are allowing the world to realize the value of physical AI through robots by accelerating the training of physical AI models and enabling continuous training in robotic systems across every industry.

Cheaper by the dozen: In addition to their smarts (or lack thereof), one big factor that has slowed adoption of humanoid robots has been affordability. As agentic AI brings new intelligence to robots, though, volume will pick up and costs will come down sharply. The average cost of industrial robots is expected to drop to $10,800 in 2025, down sharply from $46K in 2010 to $27K in 2017. As these devices become significantly cheaper, they’ll become as commonplace across industries as mobile devices are.

Deepu Talla
Vice President of Robotics and Edge Computing

Redefining robots: When people think of robots today, they’re usually images or content showing autonomous mobile robots (AMRs), manipulator arms or humanoids. But tomorrow’s robots are set to be an autonomous system that perceives, reasons, plans and acts — then learns.

Soon we’ll be thinking of robots embodied everywhere from surgical rooms and data centers to warehouses and factories. Even traffic control systems or entire cities will be transformed from static, manually operated systems to autonomous, interactive systems embodied by physical AI.

The rise of small language models: To improve the functionality of robots operating at the edge, expect to see the rise of small language models that are energy-efficient and avoid latency issues associated with sending data to data centers. The shift to small language models in edge computing will improve inference in a range of industries, including automotive, retail and advanced robotics.

Kevin Levitt
Global Director of Financial Services

AI agents boost firm operations: AI-powered agents will be deeply integrated into the financial services ecosystem, improving customer experiences, driving productivity and reducing operational costs.

AI agents will take every form based on each financial services firm’s needs. Human-like 3D avatars will take requests and interact directly with clients, while text-based chatbots will summarize thousands of pages of data and documents in seconds to deliver accurate, tailored insights to employees across all business functions.

AI factories become table stakes: AI use cases in the industry are exploding. This includes improving identity verification for anti-money laundering and know-your-customer regulations, reducing false positives for transaction fraud and generating new trading strategies to improve market returns. AI also is automating document management, reducing funding cycles to help consumers and businesses on their financial journeys.

To capitalize on opportunities like these, financial institutions will build AI factories that use full-stack accelerated computing to maximize performance and utilization to build AI-enabled applications that serve hundreds, if not thousands, of use cases — helping set themselves apart from the competition.

AI-assisted data governance: Due to the sensitive nature of financial data and stringent regulatory requirements, governance will be a priority for firms as they use data to create reliable and legal AI applications, including for fraud detection, predictions and forecasting, real-time calculations and customer service.

Firms will use AI models to assist in the structure, control, orchestration, processing and utilization of financial data, making the process of complying with regulations and safeguarding customer privacy smoother and less labor intensive. AI will be the key to making sense of and deriving actionable insights from the industry’s stockpile of underutilized, unstructured data.

Richard Kerris
Vice President of Media and Entertainment

Let AI entertain you: AI will continue to revolutionize entertainment with hyperpersonalized content on every screen, from TV shows to live sports. Using generative AI and advanced vision-language models, platforms will offer immersive experiences tailored to individual tastes, interests and moods. Imagine teaser images and sizzle reels crafted to capture the essence of a new show or live event and create an instant personal connection.

In live sports, AI will enhance accessibility and cultural relevance, providing language dubbing, tailored commentary and local adaptations. AI will also elevate binge-watching by adjusting pacing, quality and engagement options in real time to keep fans captivated. This new level of interaction will transform streaming from a passive experience into an engaging journey that brings people closer to the action and each other.

AI-driven platforms will also foster meaningful connections with audiences by tailoring recommendations, trailers and content to individual preferences. AI’s hyperpersonalization will allow viewers to discover hidden gems, reconnect with old favorites and feel seen. For the industry, AI will drive growth and innovation, introducing new business models and enabling global content strategies that celebrate unique viewer preferences, making entertainment feel boundless, engaging and personally crafted.

Ronnie Vasishta
Senior Vice President of Telecoms

The AI connection: Telecommunications providers will begin to deliver generative AI applications and 5G connectivity over the same network. AI radio access network (AI-RAN) will enable telecom operators to transform traditional single-purpose base stations from cost centers into revenue-producing assets capable of providing AI inference services to devices, while more efficiently delivering the best network performance.

AI agents to the rescue: The telecommunications industry will be among the first to dial into agentic AI to perform key business functions. Telco operators will use AI agents for a wide variety of tasks, from suggesting money-saving plans to customers and troubleshooting network connectivity, to answering billing questions and processing payments.

More efficient, higher-performing networks: AI also will be used at the wireless network layer to enhance efficiency, deliver site-specific learning and reduce power consumption. Using AI as an intelligent performance improvement tool, operators will be able to continuously observe network traffic, predict congestion patterns and make adjustments before failures happen, allowing for optimal network performance.

Answering the call on sovereign AI: Nations will increasingly turn to telcos — which have proven experience managing complex, distributed technology networks — to achieve their sovereign AI objectives. The trend will spread quickly across Europe and Asia, where telcos in Switzerland, Japan, Indonesia and Norway are already partnering with national leaders to build AI factories that can use proprietary, local data to help researchers, startups, businesses and government agencies create AI applications and services.

Xinzhou Wu
Vice President of Automotive

Pedal to generative AI metal: Autonomous vehicles will become more performant as developers tap into advancements in generative AI. For example, harnessing foundation models, such as vision language models, provides an opportunity to use internet-scale knowledge to solve one of the hardest problems in the autonomous vehicle (AV) field, namely that of efficiently and safely reasoning through rare corner cases.

Simulation unlocks success: More broadly, new AI-based tools will enable breakthroughs in how AV development is carried out. For example, advances in generative simulation will enable the scalable creation of complex scenarios aimed at stress-testing vehicles for safety purposes. Aside from allowing for testing unusual or dangerous conditions, simulation is also essential for generating synthetic data to enable end-to-end model training.

Three-computer approach: Effectively, new advances in AI will catalyze AV software development across the three key computers underpinning AV development — one for training the AI-based stack in the data center, another for simulation and validation, and a third in-vehicle computer to process real-time sensor data for safe driving. Together, these systems will enable continuous improvement of AV software for enhanced safety and performance of cars, trucks, robotaxis and beyond.

Marc Spieler
Senior Managing Director of Global Energy Industry

Welcoming the smart grid: Do you know when your daily peak home electricity is? You will soon as utilities around the world embrace smart meters that use AI to broadly manage their grid networks, from big power plants and substations and, now, into the home.

As the smart grid takes shape, smart meters — once deemed too expensive to be installed in millions of homes — that combine software, sensors and accelerated computing will alert utilities when trees in a backyard brush up against power lines or when to offer big rebates to buy back the excess power stored through rooftop solar installations.

Powering up: Delivering the optimal power stack has always been mission-critical for the energy industry. In the era of generative AI, utilities will address this issue in ways that reduce environmental impact.

Expect in 2025 to see a broader embrace of nuclear power as one clean-energy path the industry will take. Demand for natural gas also will grow as it replaces coal and other forms of energy. These resurgent forms of energy are being helped by the increased use of accelerated computing, simulation technology and AI and 3D visualization, which helps optimize design, pipeline flows and storage. We’ll see the same happening at oil and gas companies, which are looking to reduce the impact of energy exploration and production.

Azita Martin
Vice President of Retail, Consumer-Packaged Goods and Quick-Service Restaurants 

Software-defined retail: Supercenters and grocery stores will become software-defined, each running computer vision and sophisticated AI algorithms at the edge. The transition will accelerate checkout, optimize merchandising and reduce shrink — the industry term for a product being lost or stolen.

Each store will be connected to a headquarters AI network, using collective data to become a perpetual learning machine. Software-defined stores that continually learn from their own data will transform the shopping experience.

Intelligent supply chain: Intelligent supply chains created using digital twins, generative AI, machine learning and AI-based solvers will drive billions of dollars in labor productivity and operational efficiencies. Digital twin simulations of stores and distribution centers will optimize layouts to increase in-store sales and accelerate throughput in distribution centers.

Agentic robots working alongside associates will load and unload trucks, stock shelves and pack customer orders. Also, last-mile delivery will be enhanced with AI-based routing optimization solvers, allowing products to reach customers faster while reducing vehicle fuel costs.

Read More

Peak Training: Blackwell Delivers Next-Level MLPerf Training Performance

Peak Training: Blackwell Delivers Next-Level MLPerf Training Performance

Generative AI applications that use text, computer code, protein chains, summaries, video and even 3D graphics require data-center-scale accelerated computing to efficiently train the large language models (LLMs) that power them.

In MLPerf Training 4.1 industry benchmarks, the NVIDIA Blackwell platform delivered impressive results on workloads across all tests — and up to 2.2x more performance per GPU on LLM benchmarks, including Llama 2 70B fine-tuning and GPT-3 175B pretraining.

In addition, NVIDIA’s submissions on the NVIDIA Hopper platform continued to hold at-scale records on all benchmarks, including a submission with 11,616 Hopper GPUs on the GPT-3 175B benchmark.

Leaps and Bounds With Blackwell

The first Blackwell training submission to the MLCommons Consortium — which creates standardized, unbiased and rigorously peer-reviewed testing for industry participants — highlights how the architecture is advancing generative AI training performance.

For instance, the architecture includes new kernels that make more efficient use of Tensor Cores. Kernels are optimized, purpose-built math operations like matrix-multiplies that are at the heart of many deep learning algorithms.

Blackwell’s higher per-GPU compute throughput and significantly larger and faster high-bandwidth memory allows it to run the GPT-3 175B benchmark on fewer GPUs while achieving excellent per-GPU performance.

Taking advantage of larger, higher-bandwidth HBM3e memory, just 64 Blackwell GPUs were able to run in the GPT-3 LLM benchmark without compromising per-GPU performance. The same benchmark run using Hopper needed 256 GPUs.

The Blackwell training results follow an earlier submission to MLPerf Inference 4.1, where Blackwell delivered up to 4x more LLM inference performance versus the Hopper generation. Taking advantage of the Blackwell architecture’s FP4 precision, along with the NVIDIA QUASAR Quantization System, the submission revealed powerful performance while meeting the benchmark’s accuracy requirements.

Relentless Optimization

NVIDIA platforms undergo continuous software development, racking up performance and feature improvements in training and inference for a wide variety of frameworks, models and applications.

In this round of MLPerf training submissions, Hopper delivered a 1.3x improvement on GPT-3 175B per-GPU training performance since the introduction of the benchmark.

NVIDIA also submitted large-scale results on the GPT-3 175B benchmark using 11,616 Hopper GPUs connected with NVIDIA NVLink and NVSwitch high-bandwidth GPU-to-GPU communication and NVIDIA Quantum-2 InfiniBand networking.

NVIDIA Hopper GPUs have more than tripled scale and performance on the GPT-3 175B benchmark since last year. In addition, on the Llama 2 70B LoRA fine-tuning benchmark, NVIDIA increased performance by 26% using the same number of Hopper GPUs, reflecting continued software enhancements.

NVIDIA’s ongoing work on optimizing its accelerated computing platforms enables continued improvements in MLPerf test results — driving performance up in containerized software, bringing more powerful computing to partners and customers on existing platforms and delivering more return on their platform investment.

Partnering Up

NVIDIA partners, including system makers and cloud service providers like ASUSTek, Azure, Cisco, Dell, Fujitsu, Giga Computing, Lambda Labs, Lenovo, Oracle Cloud, Quanta Cloud Technology and Supermicro also submitted impressive results to MLPerf in this latest round.

A founding member of MLCommons, NVIDIA sees the role of industry-standard benchmarks and benchmarking best practices in AI computing as vital. With access to peer-reviewed, streamlined comparisons of AI and HPC platforms, companies can keep pace with the latest AI computing innovations and access crucial data that can help guide important platform investment decisions.

Learn more about the latest MLPerf results on the NVIDIA Technical Blog

Read More

‘Every Industry, Every Company, Every Country Must Produce a New Industrial Revolution,’ Says NVIDIA CEO Jensen Huang at AI Summit Japan

‘Every Industry, Every Company, Every Country Must Produce a New Industrial Revolution,’ Says NVIDIA CEO Jensen Huang at AI Summit Japan

The next technology revolution is here, and Japan is poised to be a major part of it.

At NVIDIA’s AI Summit Japan on Wednesday, NVIDIA founder and CEO Jensen Huang and SoftBank Chairman and CEO Masayoshi Son shared a sweeping vision for Japan’s role in the AI revolution.

Speaking in Tokyo, Huang underscored that AI infrastructure is essential to drive global transformation.

In his talk, he emphasized two types of AI: digital and physical. Digital is represented by AI agents, while physical AI is represented by robotics.

He said Japan is poised to create both types, leveraging its unique language, culture and data.

“Every industry, every company, every country must produce a new industrial revolution,” Huang said, pointing to AI as the catalyst for this shift.

Huang emphasized Japan’s unique position to lead in this AI-driven economy, praising the country’s history of innovation and engineering excellence as well as its technological and cultural panache.

“I can’t imagine a better country to lead the robotics AI revolution than Japan,” Huang said. “You have created some of the world’s best robots. These are the robots we grew up with, the robots we’ve loved our whole lives.”

Huang highlighted the potential of agentic AI—advanced digital agents capable of understanding, reasoning, planning, and taking action—to transform productivity across industries.

He noted that these agents can tackle complex, multi-step tasks, effectively doing “50% of the work for 100% of the people,” turbocharging human productivity.

By turning data into actionable insights, agentic AI offers companies powerful tools to enhance operations without replacing human roles.

SoftBank and NVIDIA to Build Japan’s Largest AI Supercomputer

Among the summit’s major announcements was NVIDIA’s collaboration with SoftBank to build Japan’s most powerful AI supercomputer.

NVIDIA CEO Jensen Huang showcases Blackwell, the company’s advanced AI supercomputing platform, at the AI Summit Japan in Tokyo.

Using the NVIDIA Blackwell platform, SoftBank’s DGX SuperPOD will deliver extensive computing power to drive sovereign AI initiatives, including large language models (LLMs) specifically designed for Japan.

“With your support, we are creating the largest AI data center here in Japan,” said Son, a visionary who, as Huang noted, has been a part of every major technology revolution of the past half-century.

“We should provide this platform to many of those researchers, the students, the startups, so that we can encourage … so that they have a better access [to] much more compute.”

Huang noted that the AI supercomputer project is just one part of the collaboration.

SoftBank also successfully piloted the world’s first combined AI and 5G network, known as AI-RAN (radio access network). The network enables AI and 5G workloads to run simultaneously, opening new revenue possibilities for telecom providers.

“Now with this intelligence network that we densely connect each other, [it will] become one big neural brain for the infrastructure intelligence to Japan,” Son said. “That will be amazing.”

Accelerated Computing and Japan’s AI Infrastructure

Huang emphasized the profound synergy between AI and robotics, highlighting how advancements in artificial intelligence have created new possibilities for robotics across industries.

He noted that as AI enables machines to learn, adapt and perform complex tasks autonomously, robotics is evolving beyond traditional programming.

Huang spoke to developers, researchers and AI industry leaders at this week’s NVIDIA AI Summit Japan.

“I hope that Japan will take advantage of the latest breakthroughs in artificial intelligence and combine that with your world-class expertise in mechatronics,” Huang said. “No country in the world has greater skills in mechatronics than Japan, and this is an extraordinary opportunity to seize.”

NVIDIA aims to develop a national AI infrastructure network through partnerships with Japanese cloud leaders such as GMO Internet Group and SAKURA internet.

Supported by the Japan Ministry of Economy, Trade and Industry, this infrastructure will support sectors like healthcare, automotive and robotics by providing advanced AI resources to companies and research institutions across Japan.

“This is the beginning of a new era… we can’t miss this time,” Huang added.

Read more about all of today’s announcements in the NVIDIA AI Summit Japan online press kit

Read More

Japan’s Market Innovators Bring Physical AI to Industries With NVIDIA AI and Omniverse

Japan’s Market Innovators Bring Physical AI to Industries With NVIDIA AI and Omniverse

Robots transporting heavy metal at a Toyota plant. Yaskawa’s robots working alongside human coworkers in factories. To advance efforts like these virtually, Rikei Corporation develops digital twin tooling to assist planning.

And if that weren’t enough, diversified retail holdings company Seven & i Holdings is running digital twin simulations to enhance customer experiences.

Physical AI and industrial AI, powered by NVIDIA Omniverse and Isaac and Metropolis, are propelling Japan’s industrial giants into the future. Such pioneering moves in robotic manipulation, industrial inspection and digital twins for human assistance are on full display at NVIDIA AI Summit Japan this week.

The arrival of generative AI-driven robotics leaps couldn’t come at a better time. With its population in decline, Japan has a critical need for advanced robotics. A report in the Japan Times said the nation is expected to face an 11 million shortage of workers by 2040.

Industrial and physical AI-based systems are today becoming accelerated by a three computer solution that enables robot AI model training, testing, and simulation and deployment.

Looking Into the Future With Toyota Robotics

Toyota is tapping into NVIDIA Omniverse for physics simulation for robot motion and gripping to improve its metal forging capabilities. That’s helping to reduce the time it takes to teach robots to transport forging materials.

Digital representation of robotic arm moving inside an assembly structure
Image courtesy of Toyota.

Toyota is verifying to reproduce its robotic work handling and robot motion with the accuracy of NVIDIA PhysX with Omniverse. Omniverse enables modeling digital twins of factories and other environments that accurately duplicate the physical characteristics of objects and systems in the real world, which is foundational to building physical AI for driving next-generation autonomous systems.

Omniverse enables Toyota to model things like mass properties, gravity and friction for comparing results with physical representations of tests. This can help work in manipulation and robot motion.

It also allows Toyota to replicate the expertise of its senior employees with robotics for issues requiring a high degree of skills. And it increases safety and throughput since factory personnel are not required to work in the high temperatures and harsh environments associated with metal-forging production lines.

Driving Automation, Yaskawa Harnesses NVIDIA Isaac 

Yaskawa is a leading global robotics manufacturer that has shipped more than 600,000 robots and offers nearly 200 robot models, including industrial robots for the automotive industry, collaborative robots and dual-arm robots.

robotic arm moving items into storage bins.
Image courtesy of YASKAWA.

The Japanese robotics leader is expanding into new markets with its MOTOMAN NEXT adaptive robot, which is moving into task adaptation, versatility and flexibility. Driven by advanced robotics enabled by the NVIDIA Isaac and Omniverse platforms, Yaskawa’s adaptive robots are focused on delivering automation for the food, logistics, medical and agriculture industries.

Using NVIDIA Isaac Manipulator, a reference workflow of NVIDIA-accelerated libraries and AI models, Yaskawa is integrating AI to its industrial arm robots, giving them the ability to complete a wide range of industrial automation tasks.  

Yaskawa is using FoundationPose for precise 6D pose estimation and tracking. These AI models enhance the adaptability and efficiency of Yaskawa’s robotic arms, and the motion control enables sim-to-real transition, making them versatile and effective at performing complex tasks across a wide range of industries.

Additionally, Yaskawa is embracing digital twin and robotics simulations powered by NVIDIA Isaac Sim, built on Omniverse, to accelerate the development and deployment of Yaskawa’s robotic solutions, saving time and resources.

Creating Customer Experiences at Seven & i Holdings With Omniverse, Metropolis

Seven & i Holdings is one of the largest Japanese  diversified retail holdings companies. The Japanese retail company runs a proof of concept to understand customer behaviors at its retail outlets with digital simulation.

Seven & i Holdings is pushing its research activities by tapping into NVIDIA Omniverse and NVIDIA Metropolis to better understand operations across its retail stores. Using NVIDIA Metropolis, a set of developer tools for building vision AI applications, store operations are analyzed with computer vision models, helping improve efficiency and safety. A digital twin of this environment is developed in an Omniverse-based application, along with assets from Blender and animations from SideFX Houdini.

Digital retail store with person walking down aisle, above simulated sensor captures can be visualized.
Image courtesy of Seven & i Holdings Co.

Combining digital twins with price recognition, object tracking and other AI-based computation enables it to generate useful behavioral insights about retail environments and customer interactions. Such information offers opportunities to dynamically generate and show personalized ads on digital signage displays targeted to customers.

The retailer plans to use Metropolis and the NVIDIA Merlin recommendation engine framework to create tailored suggestions to individual shoppers, responding to customer interests — based on data — like never before.

Virtually Revolutionizing, Rikei Corporation Launches Asset Library for Digital Twins

Rikei Corporation, a systems solutions provider, specializes in spatial computing and extended reality technology for the manufacturing sector.

The technology company has developed JAPAN USD Factory, which is a digital twin asset library specifically for the Japanese manufacturing industry. Developed on NVIDIA Omniverse, JAPAN USD Factory reproduces materials and equipment commonly used in manufacturing sites across Japan in a digital form so that Japanese manufacturers can more easily build digital twins of their factories and warehouses.

 Digital twin design of a manufacturing plant where a number of bins are stored on shelving.
Image courtesy of Rikei

Rikei Corporation aims to streamline various stages of design, simulation and operations for the manufacturing process with these digital assets to enhance productivity with digital twins.

Developed with OpenUSD, a universal 3D asset interchange, JAPAN USD Factory allows developers to access its asset libraries for things like palettes and racks, offering seamless integration across tools and workflows.

To learn more, watch the NVIDIA AI Summit Japan fireside chat with NVIDIA founder and CEO Jensen Huang.

Read More

Japan Develops Next-Generation Drug Design, Healthcare Robotics and Digital Health Platforms

Japan Develops Next-Generation Drug Design, Healthcare Robotics and Digital Health Platforms

To provide high-quality medical care to its population — around 30% of whom are 65 or older — Japan is pursuing sovereign AI initiatives supporting nearly every aspect of healthcare.

AI tools trained on country-specific data and local compute infrastructure are supercharging the abilities of Japan’s clinicians and researchers so they can care for patients, amid an expected shortage of nearly 500,000 healthcare workers by next year.

Breakthrough technology deployments by the country’s healthcare leaders — including in AI-accelerated drug discovery, genomic medicine, healthcare imaging and robotics — are highlighted at the NVIDIA AI Summit Japan, taking place in Tokyo through Nov. 13.

Powered by NVIDIA AI computing platforms like the Tokyo-1 NVIDIA DGX supercomputer, these applications were developed using domain-specific platforms such as NVIDIA BioNeMo for drug discovery, NVIDIA MONAI for medical imaging, NVIDIA Parabricks for genomics and NVIDIA Holoscan for healthcare robotics.

Drug Discovery AI Factories Deepen Understanding, Accuracy and Speed

NVIDIA is supporting Japan’s pharmaceutical market — one of the three largest in the world — with NVIDIA BioNeMo, an end-to-end platform that enables drug discovery researchers to develop and deploy AI models for generating biological intelligence from biomolecular data.

BioNeMo includes a customizable, modular programming framework and NVIDIA NIM microservices for optimized AI inference. New models include AlphaFold2, which predicts the 3D structure of a protein from its amino acid sequence; DiffDock, which predicts the 3D structure of a molecule interacting with a protein; and RFdiffusion, which designs novel protein structures likely to bind with a target molecule.

The platform also features BioNeMo Blueprints, a catalog of customizable reference AI workflows to help developers scale biomolecular AI models to enterprise-grade applications.

The NIM microservice for AlphaFold2 now integrates MMSeqs2-GPU, an evolutionary information retrieval tool that accelerates the traditional AlphaFold2 pipeline by 5x. Led by researchers at Seoul National University, Johannes Gutenberg University Mainz and NVIDIA, this integration enables protein structure prediction in 8 minutes instead of 40 minutes.

At AI Summit Japan, TetraScience, a company that engineers AI-native scientific datasets, announced a collaboration with NVIDIA to industrialize the production of scientific AI use cases to accelerate and improve workflows across the life sciences value chain.

For example, choosing an optimal cell line to produce biologic therapies such as vaccines and monoclonal antibodies is a critical but time-consuming step. TetraScience’s new Lead Clone Assistant uses BioNeMo tools, including the NVIDIA VISTA-2D foundation model for cell segmentation and the Geneformer model for gene expression analysis, to reduce lead clone selection to hours instead of weeks.

Tokyo-based Astellas Pharma uses BioNeMo biomolecular AI models such as ESM-1nv, ESM-2nv and DNABERT to accelerate biologics research. Its AI models are used to generate novel molecular structures, predict how those molecules will bind to target proteins and optimize them to more effectively bind to those target proteins.

Using the BioNeMo framework, Astellas has accelerated chemical molecule generation  by more than 30x. The company plans to use BioNeMo NIM microservices to further advance its work.

Japan’s Pharma Companies and Research Institutions Advance Drug Research and Development

Astellas, Daiichi-Sankyo and Ono Pharmaceutical are leading Japanese pharma companies harnessing the Tokyo-1 system, an NVIDIA DGX AI supercomputer built in collaboration with Xeureka, a subsidiary of the Japanese business conglomerate Mitsui & Co, to build AI models for drug discovery. Xeureka is using Tokyo-1 to accelerate AI model development and molecular simulations.

Xeureka is also using NVIDIA H100 Tensor Core GPUs to explore the application of confidential computing to enhance the ability of pharmaceutical companies to collaborate on large AI model training while protecting proprietary datasets.

To further support disease and precision medicine research, genomics researchers across Japan have adopted the NVIDIA Parabricks software suite to accelerate secondary analysis of DNA and RNA data.

Among them is the University of Tokyo Human Genome Center, the main academic institution working on a government-led whole genome project focused on cancer research. The initiative will help researchers identify gene variants unique to Japan’s population and support the development of precision therapeutics.

The genome center is also exploring the use of Giraffe, a tool now available via Parabricks v4.4 that enables researchers to map genome sequences to a pangenome, a reference genome that represents diverse populations.

AI Scanners and Scopes Give Radiologists and Surgeons Real-Time Superpowers

Japan’s healthcare innovators are building AI-augmented systems to support radiologists and surgeons.

Fujifilm has developed an AI application in collaboration with NVIDIA to help surgeons perform surgery more efficiently.

This application uses an AI model developed using NVIDIA DGX systems to convert CT images into 3D simulations to support surgery.

Olympus recently collaborated with NVIDIA and telecommunications company NTT to demonstrate how cloud-connected endoscopes can efficiently run image processing and AI applications in real time. The endoscopes featured NVIDIA Jetson Orin modules for edge computing and connected to a cloud server using the NTT communication platform’s IOWN All-Photonics Network, which introduces photonics-based technology across the network to enable lower power consumption, greater capacity and lower latency.

NVIDIA is also supporting real-time AI-powered robotic systems for radiology and surgery in Japan with Holoscan, a sensor processing platform that streamlines AI model and application development for real-time insights. Holoscan includes a catalog of AI reference workflows for applications including endoscopy and ultrasound analysis.

A neurosurgeon at Showa University, a medical school with multiple campuses across Japan, has adopted Holoscan and the NVIDIA IGX platform for industrial-grade edge AI to develop  a surgical microscopy application that takes video footage from surgical scopes and converts it into 3D imagery in real time using AI. With access to 3D reconstructions, surgeons can more easily locate tumors and key structures in the brain to improve the efficiency of procedures.

Japanese surgical AI companies including AI Medical Service (AIM), Anaut, iMed Technologies and Jmees are investigating the use of Holoscan to power applications that provide diagnostic support for endoscopists and surgeons. These applications could detect anatomical structures like organs in real time, with the potential to reduce injury risks, identify conditions such as gastrointestinal cancers and brain hemorrhages, and provide immediate insights to help doctors prepare for and conduct surgeries.

Scaling Healthcare With Digital Health Agents

Older adults have higher rates of chronic conditions and use healthcare services the most — so to keep up with its aging population, Japan-based companies are at the forefront of developing digital health systems to augment patient care.

Fujifilm has launched NURA, a group of health screening centers with AI-augmented medical examinations designed to help doctors test for cancer and chronic diseases with faster examinations and lower radiation doses for CT scans.

Developed using NVIDIA DGX systems, the tool incorporates large language models that create text summaries of medical images. The AI models run on NVIDIA RTX GPUs for inference. Fujifilm is also evaluating the use of MONAI, NeMo and NIM microservices.

To learn more about NVIDIA’s collaborations with Japan’s healthcare ecosystem, watch the NVIDIA AI Summit on-demand session by Kimberly Powell, the company’s vice president of healthcare.

Read More