NVIDIA CEO Jensen Huang to Deliver Keynote Ahead of COMPUTEX 2024

NVIDIA CEO Jensen Huang to Deliver Keynote Ahead of COMPUTEX 2024

Amid an AI revolution sweeping through trillion-dollar industries worldwide, NVIDIA founder and CEO Jensen Huang will deliver a keynote address ahead of COMPUTEX 2024, in Taipei, outlining what’s next for the AI ecosystem.

Slated for June 2 at the National Taiwan University Sports Center, the address kicks off before the COMPUTEX trade show scheduled to run from June 3-6 at the Taipei Nangang Exhibition Center.

The keynote will be livestreamed at 7 p.m. Taiwan time (4 a.m. PT) on Sunday, June 2, with a replay available at NVIDIA.com.

With over 1,500 exhibitors from 26 countries and an expected crowd of 50,000 attendees, COMPUTEX is one of the world’s premier technology events.

It has long showcased the vibrant technology ecosystem anchored by Taiwan and has become a launching pad for the cutting-edge systems required to scale AI globally.

As a leader in AI, NVIDIA continues to nurture and expand the AI ecosystem. Last year, Huang’s keynote and appearances in partner press conferences exemplified NVIDIA’s role in helping advance partners across the technology industry.

These partners will be out in force this year.

NVIDIA’s partners, including Acer, ASUS, Asrock Rack, Colorful, GIGABYTE, Ingrasys, Inno3D, Inventec, MSI, Palit, Pegatron, PNY, QCT, Supermicro, Wistron, Wiwynn and Zotac will spotlight new products featuring NVIDIA technology.

In addition to the exhibition and demonstrations, Marc Hamilton, vice president of solutions architecture and engineering at NVIDIA, will take the stage at the TAITRA forum, a key segment of COMPUTEX dedicated to cutting-edge discussions in technology.

As part of the “Let’s Talk Generative AI” forum, Hamilton will present his talk, titled “Infra Build Train Go,” on June 5, from 10-10:30 a.m. at the 701 Conference Room, 7F, Taipei Nangang Exhibition Center Hall 2.

NVIDIA AI Summit

Following the keynote, the NVIDIA AI Summit on June 5 at the Grand Hilai Taipei will delve into the practical applications of AI in manufacturing, healthcare, research and more.

The summit will feature over 20 sessions from industry experts and innovators as well as training sessions for developers. Kimberly Powell, vice president of healthcare and life sciences at NVIDIA, will host a special address on how generative AI is advancing the healthcare technology industry.

Register for the AI Summit.

Read More

Generative Modeling with Phase Stochastic Bridges

This paper introduces a novel generative modeling framework grounded in phase space dynamics, taking inspiration from the principles underlying Critically Damped Langevin Dynamics (CLD). Leveraging insights from stochastic optimal control, we construct a favorable path measure in the phase space that proves highly advantageous for generative sampling. A distinctive feature of our approach is the early-stage data prediction capability within the context of propagating generating Ordinary Differential Equations (ODEs) or Stochastic Differential Equations (SDEs) processes. This early prediction…Apple Machine Learning Research

Boost employee productivity with automated meeting summaries using Amazon Transcribe, Amazon SageMaker, and LLMs from Hugging Face

Boost employee productivity with automated meeting summaries using Amazon Transcribe, Amazon SageMaker, and LLMs from Hugging Face

The prevalence of virtual business meetings in the corporate world, largely accelerated by the COVID-19 pandemic, is here to stay. Based on a survey conducted by American Express in 2023, 41% of business meetings are expected to take place in hybrid or virtual format by 2024. Attending multiple meetings daily and keeping track of all ongoing topics gets increasingly more difficult to manage over time. This can have a negative impact in many ways, from delayed project timelines to loss of customer trust. Writing meeting summaries is the usual remedy to overcome this challenge, but it disturbs the focus required to listen to ongoing conversations.

A more efficient way to manage meeting summaries is to create them automatically at the end of a call through the use of generative artificial intelligence (AI) and speech-to-text technologies. This allows attendees to focus solely on the conversation, knowing that a transcript will be made available automatically at the end of the call.

This post presents a solution to automatically generate a meeting summary from a recorded virtual meeting (for example, using Amazon Chime) with several participants. The recording is transcribed to text using Amazon Transcribe and then processed using Amazon SageMaker Hugging Face containers to generate the meeting summary. The Hugging Face containers host a large language model (LLM) from the Hugging Face Hub.

If you prefer to generate post call recording summaries with Amazon Bedrock rather than Amazon SageMaker, checkout this Bedrock sample solution. For a generative AI powered Live Meeting Assistant that creates post call summaries, but also provides live transcripts, translations, and contextual assistance based on your own company knowledge base, see our new LMA solution.

Solution overview

The entire infrastructure of the solution is provisioned using the AWS Cloud Development Kit (AWS CDK), which is an infrastructure as code (IaC) framework to programmatically define and deploy AWS resources. The framework provisions resources in a safe, repeatable manner, allowing for a significant acceleration of the development process.

Amazon Transcribe is a fully managed service that seamlessly runs automatic speech recognition (ASR) workloads in the cloud. The service allows for simple audio data ingestion, easy-to-read transcript creation, and accuracy improvement through custom vocabularies. Amazon Transcribe’s new ASR foundation model supports 100+ language variants. In this post, we use the speaker diarization feature, which enables Amazon Transcribe to differentiate between a maximum of 10 unique speakers and label a conversation accordingly.

Hugging Face is an open-source machine learning (ML) platform that provides tools and resources for the development of AI projects. Its key offering is the Hugging Face Hub, which hosts a vast collection of over 200,000 pre-trained models and 30,000 datasets. The AWS partnership with Hugging Face allows a seamless integration through SageMaker with a set of Deep Learning Containers (DLCs) for training and inference, and Hugging Face estimators and predictors for the SageMaker Python SDK.

Generative AI CDK Constructs, an open-source extension of AWS CDK, provides well-architected multi-service patterns to quickly and efficiently create repeatable infrastructure required for generative AI projects on AWS. For this post, we illustrate how it simplifies the deployment of foundation models (FMs) from Hugging Face or Amazon SageMaker JumpStart with SageMaker real-time inference, which provides persistent and fully managed endpoints to host ML models. They are designed for real-time, interactive, and low-latency workloads and provide auto scaling to manage load fluctuations. For all languages that are supported by Amazon Transcribe, you can find FMs from Hugging Face supporting summarization in corresponding languages

The following diagram depicts the automated meeting summarization workflow.

Architecture Diagram

The workflow consists of the following steps:

  1. The user uploads the meeting recording as an audio or video file to the project’s Amazon Simple Storage Service (Amazon S3) bucket, in the /recordings folder.
  2. Every time a new recording is uploaded to this folder, an AWS Lambda Transcribe function is invoked and initiates an Amazon Transcribe job that converts the meeting recording into text. Transcripts are then stored in the project’s S3 bucket under /transcriptions/TranscribeOutput/.
  3. This triggers the Inference Lambda function, which preprocesses the transcript file into an adequate format for ML inference, stores it in the project’s S3 bucket under the prefix /summaries/InvokeInput/processed-TranscribeOutput/, and invokes a SageMaker endpoint. The endpoint hosts the Hugging Face model that summarizes the processed transcript. The summary is loaded into the S3 bucket under the prefix /summaries. Note that the prompt template used in this example includes a single instruction, however for more sophisticated requirements the template can be easily extended to tailor the solution to your own use case.
  4. This S3 event triggers the Notification Lambda function, which pushes the summary to an Amazon Simple Notification Service (Amazon SNS) topic.
  5. All subscribers of the SNS topic (such as meeting attendees) receive the summary in their email inbox.

In this post, we deploy the Mistral 7B Instruct, an LLM available in the Hugging Face Model Hub, to a SageMaker endpoint to perform the summarization tasks. Mistral 7B Instruct is developed by Mistral AI. It is equipped with over 7 billion parameters, enabling it to process and generate text based on user instructions. It has been trained on a wide-ranging corpus of text data to understand various contexts and nuances of language. The model is designed to perform tasks such as answering questions, summarizing information, and creating content, among others, by following specific prompts given by users. Its effectiveness is measured through metrics like perplexity, accuracy, and F1 score, and it is fine-tuned to respond to instructions with relevant and coherent text outputs.

Prerequisites

To follow along with this post, you should have the following prerequisites:

Deploy the solution

To deploy the solution in your own AWS account, refer to the GitHub repository to access the full source code of the AWS CDK project in Python:

git clone https://github.com/aws-samples/audio-conversation-summary-with-hugging-face-and-transcribe.git
cd audio-conversation-summary-with-hugging-face-and-transcribe/infrastructure
pip install -r requirements.txt

If you are deploying AWS CDK assets for the first time in your AWS account and the AWS Region you specified, you need to run the bootstrap command first. It sets up the baseline AWS resources and permissions required for AWS CDK to deploy AWS CloudFormation stacks in a given environment:

cdk bootstrap aws://<ACCOUNT_ID>/<AWS_REGION>

Finally, run the following command to deploy the solution. Specify the summary’s recipient mail address in the SubscriberEmailAddress parameter:

cdk deploy --parameters SubscriberEmailAddress="<SUBSCRIBER_MAIL_ADDRESS>"

Test the solution

We have provided a few sample meeting recordings in the data folder of the project repository. You can upload the test.mp4 recording into the project’s S3 bucket under the /recordings folder. The summary will be saved in Amazon S3 and sent to the subscriber. The end-to-end duration is approximately 2 minutes given an input of approximately 250 tokens.

The following figure shows the input conversation and output summary.

Limitations

This solution has the following limitations:

  • The model provides high-accuracy completions for English language. You can use other languages such as Spanish, French, or Portuguese, but the quality of the completions may degrade. You can find other Hugging Face models that are better suited for other languages.
  • The model used in this post is limited by a context length of approximately 8,000 tokens, which equates to approximately 6,000 words. If a larger context length is required, you can replace the model by referencing the new model ID in the respective AWS CDK construct.
  • Like other LLMs, Mistral 7B Instruct may hallucinate, generating content that strays from factual reality or includes fabricated information.
  • The format of the recordings must be either .mp4, .mp3, or .wav.

Clean up

To delete the deployed resources and stop incurring charges, run the following command:

cdk destroy

Alternatively, to use the AWS Management Console, complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select the stack called Text-summarization-Infrastructure-stack and choose Delete.

Conclusion

In this post, we proposed an architecture pattern to automatically transform your meeting recordings into insightful conversation summaries. This workflow showcases how the AWS Cloud and Hugging Face can help you accelerate with your generative AI application development by orchestrating a combination of managed AI services such as Amazon Transcribe, and externally sourced ML models from the Hugging Face Hub such as those from Mistral AI.

If you are eager to learn more about how conversation summaries can apply to a contact center environment, you can deploy this technique in our suite of solutions for Live Call Analytics and Post Call Analytics.

References

Mistral 7B release post, by Mistral AI

Our team

This post has been created by AWS Professional Services, a global team of experts that can help realize desired business outcomes when using the AWS Cloud. We work together with your team and your chosen member of the AWS Partner Network (APN) to implement your enterprise cloud computing initiatives. Our team provides assistance through a collection of offerings that help you achieve specific outcomes related to enterprise cloud adoption. We also deliver focused guidance through our global specialty practices, which cover a variety of solutions, technologies, and industries.


About the Authors

Gabriel Rodriguez Garcia is a Machine Learning engineer at AWS Professional Services in Zurich. In his current role, he has helped customers achieve their business goals on a variety of ML use cases, ranging from setting up MLOps inference pipelines to developing a fraud detection application. Whenever he is not working, he enjoys doing physical activities, listening to podcasts, or reading books.

Jahed Zaïdi is an AI & Machine Learning specialist at AWS Professional Services in Paris. He is a builder and trusted advisor to companies across industries, helping businesses innovate faster and on a larger scale with technologies ranging from generative AI to scalable ML platforms. Outside of work, you will find Jahed discovering new cities and cultures, and enjoying outdoor activities.

Mateusz Zaremba is a DevOps Architect at AWS Professional Services. Mateusz supports customers at the intersection of machine learning and DevOps specialization, helping them to bring value efficiently and securely. Beyond tech, he is an aerospace engineer and avid sailor.

Kemeng Zhang is currently working at AWS Professional Services in Zurich, Switzerland, with a specialization in AI/ML. She has been part of multiple NLP projects, from behavioral change in digital communication to fraud detection. Apart from that, she is interested in UX design and playing cards.

Read More

How Veritone uses Amazon Bedrock, Amazon Rekognition, Amazon Transcribe, and information retrieval to update their video search pipeline

How Veritone uses Amazon Bedrock, Amazon Rekognition, Amazon Transcribe, and information retrieval to update their video search pipeline

This post is co-written with Tim Camara, Senior Product Manager at Veritone.

Veritone is an artificial intelligence (AI) company based in Irvine, California. Founded in 2014, Veritone empowers people with AI-powered software and solutions for various applications, including media processing, analytics, advertising, and more. It offers solutions for media transcription, facial recognition, content summarization, object detection, and other AI capabilities to solve the unique challenges professionals face across industries.

Veritone began its journey with its foundational AI operating system, aiWARETM, solving industry and brand-specific challenges by building applications on top of this powerful technology. Growing in the media and entertainment space, Veritone solves media management, broadcast content, and ad tracking issues. Alongside these applications, Veritone offers media services including AI-powered audio advertising and influencer marketing, content licensing and media monetization services, and professional services to build bespoke AI solutions.

With a decade of enterprise AI experience, Veritone supports the public sector, working with US federal government agencies, state and local government, law enforcement agencies, and legal organizations to automate and simplify evidence management, redaction, person-of-interest tracking, and eDiscovery. Veritone has also expanded into the talent acquisition space, serving HR teams worldwide with its powerful programmatic job advertising platform and distribution network.

Using generative AI and new multimodal foundation models (FMs) could be very strategic for Veritone and the businesses they serve, because it would significantly improve media indexing and retrieval based on contextual meaning—a critical first step to eventually generating new content. Building enhanced semantic search capabilities that analyze media contextually would lay the groundwork for creating AI-generated content, allowing customers to produce customized media more efficiently.

Veritone’s current media search and retrieval system relies on keyword matching of metadata generated from ML services, including information related to faces, sentiment, and objects. With recent advances in large language models (LLMs), Veritone has updated its platform with these powerful new AI capabilities. Looking ahead, Veritone wants to take advantage of new advanced FM techniques to improve the quality of media search results of “Digital Media Hub”( DMH ) and grow the number of users by achieving a better user experience.

In this post, we demonstrate how to use enhanced video search capabilities by enabling semantic retrieval of videos based on text queries. We match the most relevant videos to text-based search queries by incorporating new multimodal embedding models like Amazon Titan Multimodal Embeddings to encode all visual, visual-meta, and transcription data. The primary focus is building a robust text search that goes beyond traditional word-matching algorithms as well as an interface for comparing search algorithms. Additionally, we explore narrowing retrieval to specific shots within videos (a shot is a series of interrelated consecutive pictures taken contiguously by a single camera representing a continuous action in time and space). Overall, we aim to improve video search through cutting-edge semantic matching, providing an efficient way to find videos relevant to your rich textual queries.

Solution overview

We use the following AWS services to implement the solution:

Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon within a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

The current architecture consists of three components:

  • Metadata generation – This component generates metadata from a video archive, processes it, and creates embeddings for search indexing. The videos from Amazon S3 are retrieved and converted to H264 vcodec format using the FFmpeg library. The processed videos are sent to AWS services like Amazon Rekognition, Amazon Transcribe, and Amazon Comprehend to generate metadata at shot level and video level. We use the Amazon Titan Text and Multimodal Embeddings models to embed the metadata and the video frames and index them in OpenSearch Service. We use AWS Step Functions to orchestrate the entire pipeline.
  • Search – A UI-based video search pipeline takes in the user query as input and retrieves relevant videos. The user query invokes a Lambda function. Based on the search method selected, you either perform a text- or keyword-based search or an embedding-based search. The search body is sent to OpenSearch Service to retrieve video results at the shot level, which is displayed to the user.
  • Evaluation – The UI enables you to perform qualitative evaluation against different search settings. You enter a query and, based on the search settings, video results are retrieved from OpenSearch. You can view the results and provide feedback by voting for the winning setting.

The following diagram illustrates the solution architecture.

The high-level takeaways from this work are the following:

  • Using an Amazon Rekognition API to detect shots and index them achieved better retrieving recall (at least 50% improvement) than performing the same on the video level
  • Incorporating the Amazon Titan Text Embeddings model to semantically retrieve the video results instead of using raw text generated by Amazon Rekognition and Amazon Transcribe boosted the recall performance by 52%
  • The Amazon Titan Multimodal Embeddings model showed high capability to encode visual information of video image frames and achieved the best performance when combined with text embeddings of Amazon Rekognition and Amazon Transcribe text metadata, improving on baseline metrics by up to three times
  • The A/B evaluation UI that we developed to test new search methods and features proved to be effective

Detailed quantitative analysis of these conclusions is discussed later in this post.

Metadata generation pipeline

The video metadata generation pipeline consists of processing video files using AWS services such as Amazon Transcribe, Amazon Rekognition, and Amazon Comprehend, as shown in the following diagram. The metadata is generated at the shot level for a video.

In this section, we discuss the details of each service and the workflow in more detail.

Amazon Transcribe

The transcription for the entire video is generated using the StartTranscriptionJob API. When the job is complete, you can obtain the raw transcript data using GetTranscriptionJob. The GetTranscriptionJob returns a TranscriptFileUri, which can be processed to get the speakers and transcripts based on a timestamp. The file formats supported by Amazon Transcribe are AMR, FLAC (recommended), M4A, MP3, MP4, Ogg, WebM, and WAV (recommended).

The raw transcripts are further processed to be stored using timestamps, as shown in the following example.

Amazon Rekognition

Amazon Rekognition requires the video to be encoded using the H.264 codec and formatted to either MPEG-4 or MOV. We used FFmpeg to format the videos in Amazon S3 to the required vcodec. FFmpeg is a free and open-source software project in the form of a command line tool designed for processing video, audio, and other multimedia files and streams. Python provides a wrapper library around the tool called ffmpeg-python.

The solution runs Amazon Rekognition APIs for label detection, text detection, celebrity detection, and face detection on videos. The metadata generated for each video by the APIs is processed and stored with timestamps. The videos are then segmented into individual shots. With Amazon Rekognition, you can detect the start, end, and duration of each shot as well as the total shot count for a content piece. The video shot detection job starts with the StartSegmentDetection API, which returns a jobId that can be used to monitor status with the GetSegmentDetection API. When the video segmentation status changes to Succeeded, for each shot, you parse the previously generated Amazon Rekognition API metadata using the shot’s timestamp. You then append this parsed metadata to the shot record. Similarly, the full transcript from Amazon Transcribe is segmented using the shot start and end timestamps to create shot-level transcripts.

Amazon Comprehend

The temporal transcripts are then processed by Amazon Comprehend to detect entities and sentiments using the DetectEntities, DetectSentiment, and DetectTargetedSentiment APIs. The following code gives more details on the API requests and responses used to generate metadata by using sample shot-level metadata generated for a video:

Metadata processing

The shot-level metadata generated by the pipeline is processed to stage it for embedding generation. The goal of this processing is to aggregate useful information and remove null or less significant information that wouldn’t add value for embedding generation.

The processing algorithm is as follows:

rekognition_metadata
  - shot_metadata: extract StartFrameNumber and EndFrameNumber
  - celeb_metadata: extract celeb_metadata
  - label_metadata: extract unique labels
  - text_metadata: extract unique text labels if there are more than 3 words (comes noisy with "-", "null" and other values)
  - face_analysis_metadata: extract unique list of AgeRange, Emotions, Gender
We combine all rekognition text data into `rek_text_metadata` string
transcribe_metadata
  - transcribe_metadata: check the wordcount of the conversation across all speakers.
if it is more than 50 words, mark it for summarization task with Amazon Bedrock
comprehend_metadata
  - comprehend_metadata: extract sentiment
  - comprehend_metadata: extract target sentiment scores for words with score > 0.9

Large transcript summarization

Large transcripts from the processed metadata are summarized through the Anthropic Claude 2 model. After summarizing the transcript, we extract the names of the key characters mentioned in the summary as well the important keywords.

Embeddings generation

In this section, we discuss the details for generating shot-level and video-level embeddings.

Shot-level embeddings

We generate two types of embeddings: text and multimodal. To understand which metadata and service contributes to the search performance and by how much, we create a varying set of embeddings for experimental analysis.

We implement the following with Amazon Titan Multimodal Embeddings:

  • Embed image:
    • TMM_shot_img_embs – We sample the middle frame from every shot and embed them. We assume the middle frame in the shot captures the semantic nuance in the entire shot. You can also experiment with embedding all the frames and averaging them.
    • TMM_rek_text_shot_emb – We sample the middle frame from every shot and embed it along with Amazon Rekognition text data.
    • TMM_transcribe_shot_emb – We sample the middle frame from every shot and embed it along with Amazon Transcribe text data.
  • Embed text (to compare if the text data is represented well with the LLM or multimodal model, we also embed them with Amazon Titan Multimodal):
    • TMM_rek_text_emb – We embed the Amazon Rekognition text as multimodal embeddings without the images.
    • TMM_transcribe_emb – We embed the Amazon Transcribe text as multimodal embeddings without the images.

We implement the following with the Amazon Titan Text Embeddings model:

  • Embed text:
    • TT_rek_text_emb – We embed the Amazon Rekognition text as text embeddings
    • TT_transcribe_emb – We embed the Amazon Transcribe text as text embeddings

Video-level embeddings

If a video has only one shot (a small video capturing a single action), the embeddings will be the same as shot-level embeddings.

For videos that have more than one shot, we implement the following using the Amazon Titan Multimodal Embeddings Model:

  • Embed image:
    • TMM_shot_img_embs – We sample K images with replacement across all the shot-level metadata, generate embeddings, and average them
    • TMM_rek_text_shot_emb – We sample K images with replacement across all the shot-level metadata, embed it along with Amazon Rekognition text data, and average them.
    • TMM_transcribe_shot_emb – We sample K images with replacement across all the shot-level metadata, embed it along with Amazon Transcribe text data, and average them
  • Embed text:
    • TMM_rek_text_emb – We combine all the Amazon Rekognition text data and embed it as multimodal embeddings without the images
    • TMM_transcribe_emb – We combine all the Amazon Transcribe text data and embed it as multimodal embeddings without the images

We implement the following using the Amazon Titan Text Embeddings model:

  • Embed text:
    • TT_rek_text_emb – We combine all the Amazon Rekognition text data and embed it as text embeddings
    • TT_transcribe_emb – We combine all the Amazon Transcribe text data and embed it as text embeddings

Search pipeline

In this section, we discuss the components of the search pipeline.

Search index creation

We use an OpenSearch cluster (OpenSearch Service domain) with t3.medium.search to store and retrieve indexes for our experimentation with text, knn_vector, and Boolean fields indexed. We recommend exploring Amazon OpenSearch Serverless for production deployment for indexing and retrieval. OpenSearch Serverless can index billions of records and has expanded its auto scaling capabilities to efficiently handle tens of thousands of query transactions per minute.

The following screenshots are examples of the text, Boolean, and embedding fields that we created.

Query flow

The following diagram illustrates the query workflow.

You can use a user query to compare the video records using text or semantic (embedding) search for retrieval.

For text-based retrieval, we use the search query as input to retrieve results from OpenSearch Service using the search fields transcribe_metadata, transcribe_summary, transcribe_keyword, transcribe_speakers, and rek_text_metadata:

OpenSearch Input

search_fields=[
    "transcribe_metadata",
    "transcribe_summary",
    "transcribe_keyword",
    "transcribe_speakers",
    "rek_text_metadata"
]
search_body = { 
   "query": { 
      "multi_match": { 
          "query": search_query, 
          "fields": search_fields 
      } 
   } 
}

For semantic retrieval, the query is embedded using the amazon.Titan-embed-text-v1 or amazon.titan-embed-image-v1 model, which is then used as an input to retrieve results from OpenSearch Service using the search field name, which could match with the metadata embedding of choice:

OpenSearch Input

search_body = {
        "size": <number of top results>,
        "fields": ["name"],
        "query": {
            "knn": {
                vector_field: {"vector": <embedding>, "k": <length of embedding>}
            }
       },
}

Search results combination

Exact match and semantic search have their own benefits depending on the application. Users who search for a specific celebrity or movie name would benefit from an exact match search, whereas users looking for thematic queries like “summer beach vibes” and “candlelit dinner” would find semantic search results more applicable. To enable the best of both, we combine the results from both types of searches. Additionally, different embeddings could capture different semantics (for example, Amazon Transcribe text embedding vs. image embedding with a multimodal model). Therefore, we also explore combining different semantic search results.

To combine search results from different search methods and different score ranges, we used the following logic:

  1. Normalize the scores from each results list independently to a common 0–1 range using rank_norm.
  2. Sum the weighted normalized scores for each result video from all the search results.
  3. Sort the results based on the score.
  4. Return the top K results.

We use the rank_norm method, where the score is calculated based on the rank of each video in the list. The following is the Python implementation of this method:

def rank_norm(results):
    n_results = len(results)
    normalized_results = {}
    for i, doc_id in enumerate(results.keys()):
        normalized_results[doc_id] = 1 - (i / n_results)
    ranked_normalized_results = sorted(
        normalized_results.items(), key=lambda x: x[1], reverse=True
    )
    return dict(ranked_normalized_results)

Evaluation pipeline

In this section, we discuss the components of the evaluation pipeline.

Search and evaluation UI

The following diagram illustrates the architecture of the search and evaluation UI.

The UI webpage is hosted in an S3 bucket and deployed using Amazon CloudFront distributions. The current approach uses an API key for authentication. This can be enhanced by using Amazon Cognito and registering users. The user can perform two actions on the webpage:

  • Search – Enter the query to retrieve video content
  • Feedback – Based on the results displayed for a query, vote for the winning method

We create two API endpoints using Amazon API Gateway: GET /search and POST /feedback. The following screenshot illustrates our UI with two retrieval methods that have been anonymized for the user for a bias-free evaluation.

GET /search

We pass two QueryStringParameters with this API call:

  • query – The user input query
  • method – The method the user is evaluating

This API is created with a proxy integration with a Lambda function invoked. The Lambda function processes the query and, based on the method used, retrieves results from OpenSearch Service. The results are then processed to retrieve videos from the S3 bucket and displayed on the webpage. In the search UI, we use a specific method (search setting) to retrieve results:

Request
?query=<>&method=<>

Response

{
    "results": [
        {"name": <video-name>, "score": <score>}, 
        {"name": <video-name>, "score": <score>},
        ...
    ]
}

The following is a sample request:

?query=candlelit dinner&method=MethodB

The following screenshot shows our results.

POST /feedback

Given a query, each method will have video content and the video name displayed on the webpage. Based on the relevance of the results, the user can vote if a particular method has better performance over the other (win or lose) or if the methods are tied. The API has a proxy connection to Lambda. Lambda stores these results into an S3 bucket. In the evaluation UI, you can analyze the method search results to find the best search configuration setting. The request body includes the following syntax:

Request Body

{
    "result": <winning method>,
    "searchQuery":<query>,
    "sessionId":<current-session-id>,
    "Method<>":{
        "methodType": <Type of method used>,
        "results":"[{"name":<video-name>,"score":<score>}]"},
    "Method<>":{
        "methodType": <Type of method used>,
        "results":"[{"name":"1QT426_s01","score":1.5053753}]"}
}

The following screenshot shows a sample request.

Experiments and results

In this section, we discuss the datasets used in our experiments and the quantitative and qualitative evaluations based on the results.

Short videos dataset

This dataset includes 500 videos with an average length of 20 seconds. Each video has manually written metadata such as keywords and descriptions. In general, the videos in this dataset are related to travel, vacations, and restaurants topics.

The majority of videos are less than 20 seconds and the maximum is 400 seconds, as illustrated in the following figure.

Long videos dataset

The second dataset has 300 high-definition videos with a video length ranging from 20–160 minutes, as illustrated in the following figure.

Quantitative evaluation

We use the following metrics in our quantitative evaluation:

  • Mean reciprocal rankMean reciprocal rank (MRR) measures the inverse of the position number of the most relevant item in search results.
  • Recall@topK – We measure recall at topk as the percentage of correctly retrieved video out of the desired video search results (ground truth). For example:

A, B, C are related (GT)
A, D, N, M, G are the TopK retrieved videos
Recall @TOP5 = 1/3

We compute these metrics using a ground truth dataset provided by Veritone that had mappings of search query examples to relevant video IDs.

The following table summarizes the top three retrieval methods from the long videos dataset (% improvement over baseline).

Methods Video Level: MRR vs. Video-level Baseline MRR Shot Level: MRR vs. Video-level Baseline MRR Video Level: Recall@top10 vs. Video-level Baseline Recall@top10 Shot Level: Recall@top10 vs. Video-level Baseline Recall@top10
Raw Text: Amazon Transcribe + Amazon Rekognition Baseline comparison N/A . .
Semantic: Amazon Transcribe + Amazon Rekognition 0.84% 52.41% 19.67% 94.00%
Semantic: Amazon Transcribe + Amazon Rekognition + Amazon Titan Multimodal 37.31% 81.19% 71.00% 93.33%
Semantic: Amazon Transcribe + Amazon Titan Multimodal 15.56% 58.54% 61.33% 121.33%

The following are our observations on the MRR and recall results:

  • Overall shot-level retrieval outperforms the video-level retrieval baseline across both MRR and recall metrics.
  • Raw text has lower MRR and recall scores than embedding-based search on both video and shot level. All three semantic methods show improvement in MRR and recall.
  • Combining semantic (Amazon Transcribe + Amazon Rekognition + Amazon Titan Multimodal) yields the best improvement across video MRR, shot MRR, and video recall metrics.

The following table summarizes the top three retrieval methods from the short videos dataset (% improvement over baseline).

Methods Video Level: MRR vs. Video-level Baseline MRR Shot Level: MRR vs. Video-level Baseline MRR Video Level: Recall@top10 vs. Video-Level Baseline Recall@top10 Shot Level: Recall@top10 vs. Video-level Baseline Recall@top10
Raw Text: Amazon Transcribe + Amazon Rekognition Baseline N/A Baseline N/A
Semantic: Amazon Titan Multimodal 226.67% 226.67% 373.57% 382.61%
Semantic: Amazon Transcribe + Amazon Rekognition + Amazon Titan Multimodal 100.00% 60.00% 299.28% 314.29%
Semantic: Amazon Transcribe + Amazon Titan Multimodal 53.33% 53.33% 307.21% 312.77%

We made the following observations on the MRR and recall results:

  • Encoding the videos using the Amazon Titan Multimodal Embeddings model alone yields the best result compared to adding just Amazon Transcribe, Amazon Transcribe + Rekognition, or Amazon Transcribe + Amazon Rekognition + Amazon Titan Multimodal Embeddings (due to lack of dialogue and scene changes in these short videos)
  • All semantic retrieval methods (2, 3, and 4) should at least have 53% improvement over the baseline
  • Although Amazon Titan Multimodal alone works well for this data, it should be noted that other metadata like Amazon Transcribe, Amazon Rekognition, and pre-existing human labels as semantic representation retrieval can be augmented with Amazon Titan Multimodal Embeddings to improve performance depending on the nature of the data

Qualitative evaluation

We evaluated the quantitative results from our pipeline to find matches with the ground truth shared by Veritone. However, there could be other relevant videos in the retrieved results from our pipeline that are not part of the ground truth, which could further improve some of these metrics. Therefore, to qualitatively evaluate our pipeline, we used an A/B testing framework, where a user can view results from two anonymized methods (the metadata used by the method is not exposed to reduce any bias) and rate which results were more aligned with the query entered.

The aggregated results across the method comparison were used to calculate the win rate to select the final embedding method for search pipeline.

The following methods were shortlisted based on Veritone’s interest to reduce multiple comparison methods.

Method Name (Exposed to User) Retrieval Type (Not Exposed to User)
Method E Just semantic Amazon Transcribe retrieval results
Method F Fusion of semantic Amazon Transcribe + Amazon Titan Multimodal retrieval results
Method G Fusion of semantic Amazon Transcribe + semantic Amazon Rekognition + Amazon Titan Multimodal retrieval results

The following table summarizes the quantitative results and winning rate.

 

Experiment Winning Method (Count of Queries) . .
Method E Method F Tie
Method E vs. Method F 10% 85% 5%
Method F Method G Tie
Method F vs. Method G 30% 60% 10%

Based on the results, we see that adding Amazon Titan Multimodal Embeddings to the transcription method (Method F) is better than just using semantic transcription retrieval (Method E). Adding Amazon Rekognition based retrieval results (Method G) improves over Method F.

Takeaways

We had the following key takeaways:

  • Enabling vector search indexing and retrieving instead of relying only on text matching with AI generated text metadata improves the search recall.
  • Indexing and retrieving videos at the shot level can boost performance and improve customer experience. Users can efficiently find precise clips matching their query rather than sifting through entire videos.
  • Multimodal representation of queries and metadata through models trained on both images and text have better performance over single modality representation from models trained on just textual data.
  • The fusion of text and visual cues significantly improves search relevance by capturing semantic alignments between queries and clips more accurately and semantically capturing the user search intent.
  • Enabling direct human comparison between retrieval models through A/B testing allows for inspecting and selecting the optimal approach. This can boost the confidence to ship new features or search methods to production.

Security best practices

We recommend the following security guidelines for building secure applications on AWS:

Conclusion

In this post, we showed how Veritone upgraded their classical search pipelines with Amazon Titan Multimodal Embeddings in Amazon Bedrock through a few API calls. We showed how videos can be indexed in different representations, text vs. text embeddings vs. multimodal embeddings, and how they can be analyzed to produce a robust search based on the data characteristics and use case.

If you are interested in working with the AWS Generative AI Innovation Center, please reach out to the GenAIIC.


About the Authors

Tim Camara is a Senior Product Manager on the Digital Media Hub team at Veritone. With over 15 years of experience across a range of technologies and industries, he’s focused on finding ways to use emerging technologies to improve customer experiences.

Mohamad Al Jazaery is an Applied Scientist at the Generative AI Innovation Center. As a scientist and tech lead, he helps AWS customers envision and build GenAI solutions to address their business challenges in different domains such as Media and Entertainment, Finance, and Lifestyle.


Meghana Ashok is a Machine Learning Engineer at the Generative AI Innovation Center. She collaborates closely with customers, guiding them in developing secure, cost-efficient, and resilient solutions and infrastructure tailored to their generative AI needs.

Divya Bhargavi is a Senior Applied Scientist Lead at the Generative AI Innovation Center, where she solves high-value business problems for AWS customers using generative AI methods. She works on image/video understanding and retrieval, knowledge graph augmented large language models, and personalized advertising use cases.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he uses his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

Information extraction with LLMs using Amazon SageMaker JumpStart

Information extraction with LLMs using Amazon SageMaker JumpStart

Large language models (LLMs) have unlocked new possibilities for extracting information from unstructured text data. Although much of the current excitement is around LLMs for generative AI tasks, many of the key use cases that you might want to solve have not fundamentally changed. Tasks such as routing support tickets, recognizing customers intents from a chatbot conversation session, extracting key entities from contracts, invoices, and other type of documents, as well as analyzing customer feedback are examples of long-standing needs.

What makes LLMs so transformative, however, is their ability to achieve state-of-the-art results on these common tasks with minimal data and simple prompting, and their ability to multitask. Rather than requiring extensive feature engineering and dataset labeling, LLMs can be fine-tuned on small amounts of domain-specific data to quickly adapt to new use cases. By handling most of the heavy lifting, services like Amazon SageMaker JumpStart remove the complexity of fine-tuning and deploying these models.

SageMaker JumpStart is a machine learning (ML) hub with foundation models (FMs), built-in algorithms, and prebuilt ML solutions that you can deploy with just a few clicks. With SageMaker JumpStart, you can evaluate, compare, and select FMs quickly based on predefined quality and responsibility metrics to perform tasks like article summarization and image generation.

This post walks through examples of building information extraction use cases by combining LLMs with prompt engineering and frameworks such as LangChain. We also examine the uplift from fine-tuning an LLM for a specific extractive task. Whether you’re looking to classify documents, extract keywords, detect and redact personally identifiable information (PIIs), or parse semantic relationships, you can start ideating your use case and use LLMs for your natural language processing (NLP).

Prompt engineering

Prompt engineering enables you to instruct LLMs to generate suggestions, explanations, or completions of text in an interactive way. Prompt engineering relies on large pretrained language models that have been trained on massive amounts of text data. At first glance, there might not be one best way to design a prompt, and different LLMs might work better or worse with different prompts. Therefore, prompts are often iteratively refined through trial and error to produce better results. As a starting point, you can refer to the model documentation which typically includes recommendations and best practices for prompting the model, and examples provided in SageMaker JumpStart.

In the following sections, we focus on the prompt engineering techniques required for extractive use cases. They help unlock the power of LLMs by providing helpful constraints and guide the model toward its intended behavior. We discuss the following use cases:

  • Sensitive information detection and redaction
  • Entity extraction; generic and specific entities with structured formats
  • Classification, using prompt engineering and fine-tuning

Before we explore these use cases, we need to set up our development environment.

Prerequisites

The source code accompanying this example is available in this GitHub repo. It consists of several Jupyter notebooks and a utils.py module. The utils.py module houses the shared code that is used throughout the notebooks.

The simplest way to run this example is by using Amazon SageMaker Studio with the Data Science 3.0 kernel or an Amazon SageMaker notebook instance with the conda_python3 kernel. For the instance type, you can choose the default settings.

In this example, we use ml.g5.2xlarge and ml.g5.48xlarge instances for endpoint usage, and ml.g5.24xlarge for training job usage. Use the Service Quotas console to make sure you have sufficient quotas for these instances in the Region where you’re running this example.

We use Jupyter notebooks throughout this post. Before we explore the examples, it’s crucial to confirm that you have the latest version of the SageMaker Python SDK. This SDK offers a user-friendly interface for training and deploying models on SageMaker. To install or upgrade to the latest version, run the following command in the first cell of your Jupyter notebook:

%pip install --quiet --upgrade sagemaker

Deploy Llama-2-70b-chat using SageMaker JumpStart

There are many LLMs available in SageMaker JumpStart to choose from. In this example, we use Llama-2-70b-chat, but you might use a different model depending on your use case. To explore the list of SageMaker JumpStart models, see JumpStart Available Model Table.

To deploy a model from SageMaker JumpStart, you can use either APIs, as demonstrated in this post, or use the SageMaker Studio UI. After the model is deployed, you can test it by asking a question from the model:

from sagemaker.jumpstart.model import JumpStartModel

model_id, model_version = "meta-textgeneration-llama-2-70b-f", "2.*"
endpoint_name = model_id
instance_type = "ml.g5.48xlarge"

model = JumpStartModel(
	model_id=model_id, model_version=model_version, role=role_arn
)
predictor = model.deploy(
	endpoint_name=endpoint_name, instance_type=instance_type
)

If no instance_type is provided, the SageMaker JumpStart SDK will select the default type. In this example, you explicitly set the instance type to ml.g5.48xlarge.

Sensitive data extraction and redaction

LLMs show promise for extracting sensitive information for redaction. This includes techniques such as prompt engineering, which includes priming the model to understand the redaction task, and by providing examples that can improve the performance. For example, priming the model by stating “redact sensitive information” and demonstrating a few examples of redacting names, dates, and locations can help the LLM infer the rules of the task.

More in-depth forms of priming the model include providing positive and negative examples, demonstrations of common errors, and in-context learning to teach the nuances of proper redaction. With careful prompt design, LLMs can learn to redact information while maintaining readability and utility of the document. In real-life applications, however, additional evaluation is often necessary to improve the reliability and safety of LLMs for handling confidential data. This is often achieved through the inclusion of human review, because no automated approach is entirely foolproof.

The following are a few examples of using prompt engineering for the extraction and redaction of PII. The prompt consists of multiple parts: the report_sample, which includes the text that you want to identify and mask the PII data within, and instructions (or guidance) passed on to the model as the system message.

report_sample = """
This month at AnyCompany, we have seen a significant surge in orders from a diverse clientele. On November 5th, 2023, customer Alice from US placed an order with total of $2190. Following her, on Nov 7th, Bob from UK ordered a bulk set of twenty-five ergonomic keyboards for his office setup with total of $1000. The trend continued with Jane from Australia, who on Nov 12th requested a shipment of ten high-definition monitors with total of $9000, emphasizing the need for environmentally friendly packaging. On the last day of that month, customer John, located in Singapore, finalized an order for fifteen USB-C docking stations, aiming to equip his design studio with the latest technology for total of $3600.
"""

system = """
Your task is to precisely identify Personally Identifiable Information (PII) and identifiable details, including name, address, and the person's country, in the provided text. Replace these details with exactly four asterisks (****) as the masking characters. Use '****' for masking text of any length. Only write the masked text in the response.
"""

In the following example, you define the llama2_chat function that encapsulates sending the prompt to the Llama-2 model. You reuse this function throughout the examples.

def llama2_chat(
    predictor,
    user,
    temperature=0.1,
    max_tokens=512,
    top_p=0.9,
    system=None,
):
    """Constructs the payload for the llama2 model, sends it to the endpoint,
    and returns the response."""

    inputs = []
    if system:
        inputs.append({"role": "system", "content": system})
    if user:
        inputs.append({"role": "user", "content": user})

    payload = {
        "inputs": [inputs],
        "parameters": {
            "max_new_tokens": max_tokens,
            "top_p": top_p,
            "temperature": temperature,
        },
    }
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    return response

Use the following code to call the function, passing your parameters:

response = utils.llama2_chat(
    predictor,
    system=system,
    user=report_sample,
)
print(utils.llama2_parse_output(response))

You get the following output:

This month at AnyCompany, we have seen a significant surge in orders from a diverse clientele. On November 5th, 2023, customer ***** from ***** placed an order with total of $2190. Following her, on Nov 7th, ***** from ***** ordered a bulk set of twenty-five ergonomic keyboards for his office setup with total of $1000. The trend continued with ***** from *****, who on Nov 12th requested a shipment of ten high-definition monitors with total of $9000, emphasizing the need for environmentally friendly packaging. On the last day of that month, customer *****, located in *****, finalized an order for fifteen USB-C docking stations, aiming to equip his design studio with the latest technology for total of $3600.

Entity extraction

Entity extraction is the process of identifying and extracting key information entities from unstructured text. This technique helps create structured data from unstructured text and provides useful contextual information for many downstream NLP tasks. Common applications for entity extraction include building a knowledge base, extracting metadata to use for personalization or search, and improving user inputs and conversation understanding within chatbots.

You can effectively use LLMs for entity extraction tasks through careful prompt engineering. With a few examples of extracting entities from text, explanatory prompts, and the desired output format, the model can learn to identify and extract entities such as people, organizations, and locations from new input texts. In the following examples, we demonstrate a few different entity extraction tasks ranging from simpler to more complex using prompt engineering with the Llama-2-70b-chat model you deployed earlier.

Extract generic entities

Use the following code to extract specific entities:

email_sample = "Hello, My name is John. Your AnyCompany Financial Services, LLC credit card account 1111-0000-1111-0008 has a minimum payment of $24.53 that is due by July 31st. Based on your autopay settings, we will withdraw your payment on the due date from your bank account number XXXXXX1111 with the routing number XXXXX0000. Customer feedback for Sunshine Spa, 123 Main St, Anywhere. Send comments to Alice at alice_aa@anycompany.com and Bob at bob_bb@anycompany.com. I enjoyed visiting the spa. It was very comfortable but it was also very expensive. The amenities were ok but the service made the spa a great experience."

system = """
Your task is to precisely identify any email addresses from the given text and then write them, one per line. Remember to ONLY write an email address if it's precisely spelled out in the input text. If there are no email addresses in the text, write "N/A". DO NOT write anything else.
"""

result = utils.llama2_chat(predictor, system=system, user=email_sample)
print(utils.llama2_parse_output(result))

You get the following output:

alice_aa@anycompany.com
bob_bb@anycompany.com

Extract specific entities in a structured format

Using the previous sample report, you can extract more complex information in a structured manner. This time, you provide a JSON template for the model to use and return the output in JSON format.

With LLMs generating JSON documents as output, you can effortlessly parse them into a range of other data structures. This enables simple conversions to dictionaries, YAML, or even Pydantic models using third-party libraries, such as LangChain’s PydanticOutputParser. You can see the implementation in the GitHub repo.

import json

system = """
Your task is to precisely extract information from the text provided, and format it according to the given JSON schema delimited with triple backticks. Only include the JSON output in your response. If a specific field has no available data, indicate this by writing `null` as the value for that field in the output JSON. In cases where there is no data available at all, return an empty JSON object. Avoid including any other statements in the response.

```
{json_schema}
```
"""

json_schema = """
{
    "orders":
        [
            {
                "name": "<customer_name>",
                "location": "<customer_location>",
                "order_date": "<order_date in format YYYY-MM-DD>",
                "order_total": "<order_total>",
                "order_items": [
                    {
                        "item_name": "<item_name>",
                        "item_quantity": "<item_quantity>"
                    }
                ]
            }
        ]
}
"""


response = utils.llama2_chat(
    predictor,
    system=system.format(json_schema=json_schema),
    user=report_sample,
)
json_str = utils.llama2_parse_output(response)
print(json_str)

You get the following output:

{
    "orders": [
        {
            "name": "Alice",
            "location": "US",
            "order_date": "2023-11-05",
            "order_total": 2190,
            "order_items": [
                {
                    "item_name": null,
                    "item_quantity": null
                }
            ]
        },
        {
            "name": "Bob",
            "location": "UK",
            "order_date": "2023-11-07",
            "order_total": 1000,
            "order_items": [
                {
                    "item_name": "ergonomic keyboards",
                    "item_quantity": 25
                }
            ]
        },
        {
            "name": "Jane",
            "location": "Australia",
            "order_date": "2023-11-12",
            "order_total": 9000,
            "order_items": [
                {
                    "item_name": "high-definition monitors",
                    "item_quantity": 10
                }
            ]
        },
        {
            "name": "John",
            "location": "Singapore",
            "order_date": "2023-11-30",
            "order_total": 3600,
            "order_items": [
                {
                    "item_name": "USB-C docking stations",
                    "item_quantity": 15
                }
            ]
        }
    ]
}

Classification using prompt engineering

LLMS can be a useful tool for information extraction tasks such as text classification. Common applications include classifying the intents of user interactions via channels such as email, chatbots, voice, and others, or categorizing documents to route their requests to downstream systems. The initial step involves identifying the intent or class of the user’s request or the document. These intents or classes could take many forms—from short single words to thousands of hierarchical classes and sub-classes.

In the following examples, we demonstrate prompt engineering on synthetic conversation data to extract intents. Additionally, we show how pre-trained models can be assessed to determine if fine-tuning is needed.

Let’s start with the following example. You have a list of customer interactions with an imaginary health and life insurance company. To start, use the Llama-2-70b-chat model you deployed in the previous section:

inference_instance_type = "ml.g5.48xlarge"

# Llama-2-70b chat
model_id, model_version = "meta-textgeneration-llama-2-70b-f", "2.*"
endpoint_name = model_id

predictor = utils.get_predictor(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
    inference_instance_type=inference_instance_type,
)

The get_predictor function is a helper function that creates a predictor object from a model ID and version. If the specified endpoint doesn’t exist, it creates a new endpoint and deploy the model. If the endpoint already exists, it uses the existing endpoint.

customer_interactions = [
    """Hello, I've recently moved to a new state and I need to update my address for my health insurance policy.
Can you assist me with that?
""",
    """Good afternoon! I'm interested in adding dental coverage to my existing health plan.
Could you provide me the options and prices?
""",
    """I had a disappointing experience with the customer service yesterday regarding my claim.
I want to file a formal complaint and speak with a supervisor.
""",
]

system = """
Your task is to identify the customer intent from their interactions with support bot in the provided text. The intent output must not more than 4 words. If the intent is not clear, please provide a fallback intent of "unknown".
"""

def get_intent(system, customer_interactions):
    for customer_interaction in customer_interactions:
        response = utils.llama2_chat(
            predictor,
            system=system,
            user=customer_interaction,
        )
        content = utils.llama2_parse_output(response)
        print(content)
get_intent(system, customer_interactions)

You get the following output:

Update Address
Intent: Informational
Intent: Escalate issue

Looking at the output, these seem reasonable as the intents. However, the format and style of the intents can vary depending on the language model. Another limitation of this approach is that intents are not confined to a predefined list, which means the language model might generate and word the intents differently each time you run it.

To address this, you can use the in-context learning technique in prompt engineering to steer the model towards selecting from a predefined set of intents, or class labels, that you provide. In the following example, alongside the customer conversation, you include a list of potential intents and ask the model to choose from this list:

system = """
Your task is to identify the intent from the customer interaction with the support bot. Select from the intents provided in the following list delimited with ####. If the intent is not clear, please provide a fallback intent of "unknown". ONLY write the intent.

####
- information change
- add coverage
- complaint
- portal navigation
- free product upgrade
####
"""

get_intent(system, customer_interactions)

You get the following output:

information change
add coverage
complaint

Reviewing the results, it’s evident that the language model performs well in selecting the appropriate intent in the desired format.

Sub-intents and intent trees

If you make the preceding scenario more complex, as in many real-life use cases, intents can be designed in a large number of categories and also in a hierarchical fashion, which will make the classification tasks more challenging for the model. Therefore, you can further improve and modify your prompt to provide an example to the model, also known as n-shot learning, k-shot learning, or few-shot learning.

The following is the intent tree to use in this example. You can find its source code in the utils.py file in the code repository.

INTENTS = [
    {
        "main_intent": "profile_update",
        "sub_intents": [
            "contact_info",
            "payment_info",
            "members",
        ],
    },
    {
        "main_intent": "health_cover",
        "sub_intents": [
            "add_extras",
            "add_hospital",
            "remove_extras",
            "remove_hospital",
            "new_policy",
            "cancel_policy",
        ],
    },
    {
        "main_intent": "life_cover",
        "sub_intents": [
            "new_policy",
            "cancel_policy",
            "beneficiary_info",
        ],
    },
    {
        "main_intent": "customer_retention",
        "sub_intents": [
            "complaint",
            "escalation",
            "free_product_upgrade",
        ],
    },
    {
        "main_intent": "technical_support",
        "sub_intents": [
            "portal_navigation",
            "login_issues",
        ],
    },
]

Using the following prompt (which includes the intents), you can ask the model to pick from the provided list of intents:

system = """
Your task is to identify the intent from the customer interaction with the support bot. Identify the intent of the provided text using the list of provided intent tree delimited with ####. The intents are defined in classes and sub-classes. Write the intention with this format: <main-intent>:<sub-intent>. ONLY write the intent.

OUTPUT EXAMPLE:
profile_update:contact_info

OUTPUT EXAMPLE:
customer_retention:complaint

####
{intents}
####
"""

intents_json = json.dumps(utils.INTENTS, indent=4)
system = system.format(intents=intents_json)
get_intent(system, customer_interactions)

You get the following output:

profile_update:contact_info
health_cover:add_extras
customer_retention:complaint

Although LLMs can often correctly identify intent from a list of possible intents, they may sometimes produce additional outputs or fail to adhere to the exact intent structure and output schema. There are also scenarios where intents are not as straightforward as they initially seem or are highly specific to a business domain context that the model doesn’t fully comprehend.

As an example, in the following sample interaction, the customer ultimately wants to change their coverage, but their immediate question and interaction intent is to get help with portal navigation. Similarly, in the second interaction, the more appropriate intent is “free product upgrade” which the customer is requesting. However, the model is unable to detect these nuanced intents as accurately as desired.

customer_interactions = [
    "I want to change my coverage plan. But I'm not seeing where to do this on the online website. Could you please point me to it?",
    "I'm unhappy with the current benefits of my plan and I'm considering canceling unless there are better alternatives. What can you offer?",
]

get_intent(system, customer_interactions)

You get the following output:

profile_update:contact_info
customer_retention:complaint

Prompt engineering can often successfully extract specific intents from text. However, for some use cases, relying solely on prompt engineering has limitations. Scenarios where additional techniques beyond prompt engineering may be needed include:

  • Conversations with a large number of intent classes or long contexts that exceed the language model’s context window size, or making queries more computationally expensive
  • Desired outputs in specific formats that the model struggles to adopt
  • Enhancing model understanding of the domain or task to boost performance

In the following section, we demonstrate how fine-tuning can boost the accuracy of the LLM for the intent classification task attempted earlier.

Fine-tuning an LLM for classification

The following sections detail the fine-tuning process of the FlanT5-XL and Mistral 7B model using SageMaker JumpStart. We use the FlanT5-XL and Mistral 7B models to compare their accuracy. Both models are significantly smaller compared to the Llama-2-70b-Chat. The goal is to determine whether smaller models can achieve state-of-the-art performance on specific tasks after they’re fine-tuned.

We have fine-tuned both Mitral 7B and FlanT5-XL models. You can see the details of the Mistral 7b fine-tuning in the code repository. In the following, we outline the steps for fine-tuning and evaluating of FlanT5-XL.

Initially, you deploy (or reuse) the FlanT5 endpoint as the base_predictor, which represents the base model prior to any fine-tuning. Subsequently, you assess the performance of the models by comparing them after the fine-tuning process.

inference_instance_type = "ml.g5.2xlarge"

model_id , model_version= "huggingface-text2text-flan-t5-xl", "2.0.0"
base_endpoint_name = model_id

base_predictor = utils.get_predictor(
    endpoint_name=base_endpoint_name,
    model_id=model_id,
    model_version=model_version,
    inference_instance_type=inference_instance_type,
)

Prepare training data for fine-tuning

Preparing for fine-tuning requires organizing several files, including the dataset and template files. The dataset is structured to align with the required input format for fine-tuning. For example, each record in our training dataset adheres to the following structure:

{"query": "customer query", "response": "main-intent:sub-intent"}

In this example, you use a synthesized dataset comprising customer interactions with a fictional insurance company. To learn more about the data and gain access to it, refer to the source code.

intent_dataset_file = "data/intent_dataset.jsonl"
intent_dataset_train_file = "data/intent_dataset_train.jsonl"
intent_dataset_test_file = "data/intent_dataset_test.jsonl"
ft_template_file = "data/template.json"

The following is the prompt for fine-tuning. The prompt has the query parameter, which is set during the fine-tuning using the SageMaker JumpStart SDK.

FT_PROMPT = """Identify the intent classes from the given user query, delimited with ####. Intents are categorized into two levels: main intent and sub intent. In your response, provide only ONE set of main and sub intents that is most relevant to the query. Write your response ONLY in this format <main-intent>:<sub-intent>. ONLY Write the intention.

OUTPUT EXAMPLE:
profile_update:contact_info

OUTPUT EXAMPLE:
technical_support:portal_navigation

#### QUERY:
{query}
####
"""

The following creates a template file that will be used by the SageMaker JumpStart framework to fine-tune the model. The template has two fields, prompt and completion. These fields are used to pass labeled data to the model for the fine-tuning process.

template = {
    "prompt": utils.FT_PROMPT,
    "completion": "{response}",
}

with open(ft_template_file, "w") as f:
    json.dump(template, f)

The training data is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, setting the stage for the actual fine-tuning process.

train_data_location = utils.upload_train_and_template_to_s3(
    bucket_prefix="intent_dataset_flant5",
    train_path=intent_dataset_train_file,
    template_path=ft_template_file,
)

Fine-tune the model

Configure the JumpStartEstimator, specifying your chosen model and other parameters like instance type and hyperparameters (in this example, you use five epochs for the training). This estimator drives the fine-tuning process.

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id=model_id,
    disable_output_compression=True,
    instance_type="ml.g5.24xlarge",
    role=utils.get_role_arn(),
)

estimator.set_hyperparameters(
    instruction_tuned="True", epochs="5", max_input_length="1024"
)

estimator.fit({"training": train_data_location})

Deploy the fine-tuned model

After fine-tuning, deploy the fine-tuned model:

finetuned_endpoint_name = "flan-t5-xl-ft-infoext"
finetuned_model_name = finetuned_endpoint_name
# Deploying the finetuned model to an endpoint
finetuned_predictor = estimator.deploy(
    endpoint_name=finetuned_endpoint_name,
    model_name=finetuned_model_name,
)

Use the following code to test the fine-tuned model against its base model with ambiguous queries, which you saw in the previous section:

ambiguous_queries = [
    {
        "query": "I want to change my coverage plan. But I'm not seeing where to do this on the online site. Could you please show me how?",
        "main_intent": "techincal_support",
        "sub_intent": "portal_navigation",
    },
    {
        "query": "I'm unhappy with the current benefits of my plan and I'm considering canceling unless there are better alternatives. What can you offer?",
        "main_intent": "customer_retention",
        "sub_intent": "free_product_upgrade",
    },
]
for query in ambiguous_queries:
    question = query["query"]
    print("query:", question, "n")
    print(
        "expected intent:  ", f"{query['main_intent']}:{query['sub_intent']}"
    )

    prompt = utils.FT_PROMPT.format(query=question)
    response = utils.flant5(base_predictor, user=prompt, max_tokens=13)
    print("base model:  ", utils.parse_output(response))

    response = utils.flant5(finetuned_predictor, user=prompt, max_tokens=13)
    print("finetuned model:  ", utils.parse_output(response))
    print("-" * 80)

You get the following output:

query: I want to change my coverage plan. But I'm not seeing where to do this on the online site. Could you please show me how?
expected intent:   techincal_support:portal_navigation
base model:   main_intent>:sub_intent> change
finetuned model:   technical_support:portal_navigation
--------------------------------------------------------------------------------
query: I'm unhappy with the current benefits of my plan and I'm considering canceling unless there are better alternatives. What can you offer?

expected intent:   customer_retention:free_product_upgrade
base model:   main_intent>:sub_intent> cancel
finetuned model:   customer_retention:free_product_upgrade
--------------------------------------------------------------------------------

As shown in this example, the fine-tuned model is able to classify the ambiguous queries correctly.

In evaluations, fine-tuned models performed better in identifying the correct class for both clear and ambiguous intents. The following section details the benchmark’s performance overall, and against each intent.

Performance comparisons and considerations

In this section, we have gathered the evaluation results and performance benchmarks for each model, before and after fine-tuning, as well as a comparison between the prompt engineering and fine-tuning the LLM. The dataset consists of 7,824 examples, with a 90% split for training (including validation) and 10% for testing.

Model Overall Accuracy Fine-tuning Duration (minutes) Notes
Mistral-7b (fine-tuned five epochs, without classes in the prompt) 98.97% 720

Given Mistral-7b’s nature as a text generation model, parsing its output to extract intent can be challenging due to tendencies for character repetition and generation of additional characters.

Improved performance with more epochs: 98% accuracy for five epochs compared to 92% for one epoch.

Flan-T5-XL (fine-tuned one epochs, without classes in the prompt) 98.46% 150 Marginal improvement in accuracy with increased epochs: from 97.5% (one epoch) to 98.46% (five epochs).
Llama-2-70b-chat (With classes in the prompt) 78.42% N/A Low accuracy in ambiguous scenarios.
Llama-2-70b-chat (Without classes in the prompt) 10.85% N/A .
Flan-T5-XL (base model, without classes in the prompt) 0.0% N/A Unable to identify any of the intent classes with the expected format.
Mistral-7b (base model, without classes in the prompt) 0.0% N/A Unable to identify any of the intent classes with the expected format.

The following table contains a breakdown of models’ accuracy for each intent class.

Main Intent Sub-intent Example Count Llama2-70b (without classes in prompt) Llama2-70b (with classes in prompt) Flant5-XL
Fine-tuned
Mistral-7b Fine-tuned
Customer Retention Complaint 63 7.94% 44.44% 98.41% 98.41%
Customer Retention Escalation 49 91.84% 100% 100% 100%
Customer Retention Free Product Upgrade 50 0.00% 64.00% 100% 100%
Health Cover Add Extras 38 0.00% 100% 97.37% 100%
Health Cover Add Hospital 44 0.00% 81.82% 100% 97.73%
Health Cover Cancel Policy 43 0.00% 100% 100% 97.67%
Health Cover New Policy 41 0.00% 82.93% 100% 100%
Health Cover Remove Extras 47 0.00% 85.11% 100% 100%
Health Cover Remove Hospital 53 0.00% 84.90% 100% 100%
Life Cover Beneficiary Info 45 0.00% 100% 97.78% 97.78%
Life Cover Cancel Policy 47 0.00% 55.32% 100% 100%
Life Cover New Policy 40 0.00% 90.00% 92.50% 100%
Profile Update Contact Info 45 35.56% 95.56% 95.56% 95.56%
Profile Update Members 52 0.00% 36.54% 98.08% 98.08%
Profile Update Payment Info 47 40.43% 97.87% 100% 100%
Technical Support Login Issues 39 0.00% 92.31% 97.44% 100%
Technical Support Portal Navigation 40 0.00% 45.00% 95.00% 97.50%

This comparative analysis illustrates the trade-offs between fine-tuning time and model accuracy. It highlights the ability of models like Mistral-7b and FlanT5-XL to achieve higher classification accuracy through fine-tuning. Additionally, it shows how smaller models can match or surpass the performance of larger models on specific tasks when fine-tuned, contrasted with using prompt engineering alone on the larger models.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the SageMaker endpoints, configuration, and models.
  2. Delete the S3 bucket created for this example.
  3. Delete the SageMaker notebook instance (if you used one to run this example).

Summary

Large language models have revolutionized information extraction from unstructured text data. These models excel in tasks such as classifying information and extracting key entities from various documents, achieving state-of-the-art results with minimal data.

This post demonstrated the use of large language models for information extraction through prompt engineering and fine-tuning. While effective, relying solely on prompt engineering can have limitations for complex tasks that require rigid output formats or a large number of classes. In these scenarios, fine-tuning even smaller models on domain-specific data can significantly improve performance beyond what prompt engineering alone can achieve.

The post included practical examples highlighting how fine-tuned smaller models can surpass prompt engineering with larger models for such complex use cases. Although prompt engineering is a good starting point for simpler use cases, fine-tuning offers a more robust solution for complex information extraction tasks, ensuring higher accuracy and adaptability to specific use cases. SageMaker JumpStart tools and services facilitate this process, making it accessible for individuals and teams across all levels of ML expertise.

Additional reading

You can read more on using SageMaker JumpStart for intelligent document processing, fine-tuning, and evaluation of LLMs in the following resources:


About the Authors

Pooya Vahidi  is a Senior Solutions Architect at AWS, passionate about computer science, artificial intelligence, and cloud computing. As an AI professional, he is an active member of the AWS AI/ML Area-of-Depth team. With a background spanning over two decades of expertise in leading the architecture and engineering of large-scale solutions, he helps customers on their transformative journeys through cloud and AI/ML technologies.

Dr. Romina Sharifpour is a Senior Machine Learning and Artificial Intelligence Solutions Architect at Amazon Web Services (AWS). She has spent over 10 years leading the design and implementation of innovative end-to-end solutions enabled by advancements in ML and AI. Romina’s areas of interest are natural language processing, large language models, and MLOps.

Read More

NVIDIA DGX SuperPOD to Power U.S. Government Generative AI

NVIDIA DGX SuperPOD to Power U.S. Government Generative AI

In support of President Biden’s executive order on AI, the U.S. government will use an NVIDIA DGX SuperPOD to produce generative AI advances in climate science, healthcare and cybersecurity.

The executive order, issued in October, is aimed at ensuring U.S. leadership in AI and managing its risks. MITRE, a nonprofit organization that operates federally funded research and development centers, is implementing a new NVIDIA DGX SuperPOD system that will provide researchers and developers access to massive computing leaps.

The DGX SuperPOD will support MITRE’s Federal AI Sandbox, a platform to improve experimentation with next-generation, AI-enabled applications across federal government agencies.

“The recent executive order on AI encourages federal agencies to reduce barriers for AI adoptions, but agencies often lack the computing environment necessary for experimentation and prototyping,” said Charles Clancy, senior vice president and chief technology officer at MITRE. “Our new Federal AI Sandbox will help level the playing field, making the high-quality compute power needed to train and test custom AI solutions available to any agency.”

The Federal AI Sandbox will deliver federal agencies the computing gains needed to train large language models and other generative AI tools to develop cutting-edge applications.

The NVIDIA DGX SuperPOD system powering the sandbox is capable of an exaflop of 8-bit AI compute, meaning it performs a quintillion math operation each second to train and deploy custom LLMs and other AI solutions at scale.

The supercomputing initiative comes as the White House recently unveiled plans, which include NVIDIA, for a $110 million partnership to help universities teach AI skills.

Read More

LoftQ: Reimagining LLM fine-tuning with smarter initialization

LoftQ: Reimagining LLM fine-tuning with smarter initialization

This research paper was presented at the 12th International Conference on Learning Representations (opens in new tab) (ICLR 2024), the premier conference dedicated to the advancement of deep learning.

Teal background with ICLR logo on the right (head and face) with LoftQ paper on the right.

Large language models (LLMs) use extensive datasets and advanced algorithms to generate nuanced, context-sensitive content. However, their development requires substantial computational resources. To address this, we developed LoftQ, an innovative technique that streamlines the fine-tuning process—which is used to adapt pre-trained language models to perform well in specialized applications, such as analyzing medical documents. During fine-tuning, the model undergoes additional training on a smaller, task-specific dataset. This results in improved performance, such as more accurate predictions, better understanding of domain-specific language, and more relevant responses in the context of the specialized area.

LoftQ’s strength lies in its ability to combine quantization and adaptive initialization during fine-tuning. Quantization reduces the precision of model parameters, lowering memory and computation needs. This not only accelerates processing but also reduces power consumption. Adaptive initialization closely aligns the model’s parameters to its optimal pre-trained state, preserving its capabilities while minimizing resource use. Our paper, “LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models,” presented at ICLR 2024, details how this method can help make AI technologies more efficient and sustainable. 

How LoftQ works 

LoftQ builds on the principles of LoRA (opens in new tab) and QLoRA (opens in new tab). LoRA is a method that greatly reduces the number of parameters needed for training, decreasing the memory requirements for fine-tuning. QLoRA is a fine-tuning approach that uses 4-bit quantized, frozen weights and low rank adapters, significantly reducing memory requirements while maintaining high performance. This is illustrated in Table 1, which shows the amount of memory needed for fine-tuning an LLM with 7 billion parameters as well as the memory requirements for LoRA and QLoRA. LoRA achieves a fourfold reduction in memory usage, and QLoRA further reduces it by twofold.

LoftQ - Table 1: This table shows the GPU memory usage for a 7-billion parameter LLM, with the following configurations: full fine-tuning on the left, LoRA in the middle, and QLoRA on the right.
Table 1: This table shows the GPU memory usage for a 7-billion parameter LLM with the following configurations: full fine-tuning on the left, LoRA in the middle, and QLoRA on the right.

Unlike LoRA, QLoRA comes with a tradeoff, where some quality of the pretrained model is sacrificed due to the quantization of weights. LoftQ recognizes this and optimizes the initialization of quantization and low-rank adaptation matrices. That is, LoftQ seeks to identify a combination of a quantized matrix and a low rank matrix such that their sum closely approximates the original pretrained weight. This is done for every matrix that would be adapted in the model.

The LoftQ algorithm alternates between two primary steps. First it quantizes (simplifies) the weights, and then it finds the best low-rank factors that approximate the quantization between the pretrained weight and the low-rank weight. The process repeats for a few steps. This method enables the fine-tuning process to start from a more effective initial state, which preserves accuracy while using less computational power and much more simplified weights.

LoftQ requires a one-time setup to simplify and prepare these weights, allowing a fixed portion of the model’s parameters (e.g., 5 percent) to be adjusted. Once established, this configuration can be repeatedly applied as the model transitions between various tasks and settings.

Evaluating LoftQ 

Tests using various types of LLMs, including those with different combinations of encoding and decoding capabilities like the Llama-2, show that models initialized with LoftQ consistently achieve strong performance, often matching or surpassing those configured with QLoRA.

In practical terms, comparing the performance of LoftQ and QLoRA on different tasks using the Llama-2 model family yields distinct results, which are highlighted in Table 2. For the WikiText-2 dataset, which measures the model’s perplexity (lower is better), and the GSM8K dataset, which tests the model’s ability to solve basic math problems (higher is better), we demonstrate the effectiveness of varying degrees of weight simplification—averaging 3, 2.5, and 2.25 bits per weight. Our paper discusses the results in more detail. 

LoftQ - Table 2. This table compares LoftQ and QLoRA during the fine-tuning of two Llama-2 models on the Wikitext-2 and GSM8K datasets.
Table 2. This table compares LoftQ and QLoRA during the fine-tuning of two Llama-2 models on the Wikitext-2 and GSM8K datasets.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


Implications and looking forward 

LoftQ promises to advance the field of AI by accelerating research and facilitating the creation of cutting-edge tools while supporting sustainable development. While initially focused on LLMs, LoftQ’s flexible design also supports fine-tuning in other types of models, such those for vision and speech technologies. As our research progresses, we expect to make further enhancements that will boost performance on downstream tasks. We hope these improvements will lead to broader adoption across various AI applications. We’re excited about the breadth of this technology’s applicability and encourage the AI community to explore its benefits. LoftQ is available as open source through the Hugging Face PEFT library (opens in new tab).

The post LoftQ: Reimagining LLM fine-tuning with smarter initialization appeared first on Microsoft Research.

Read More

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

This paper has been accepted at the Data Problems for Foundation Models workshop at ICLR 2024.
Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we proposeWebRephrase Augmented Pre-training…Apple Machine Learning Research

A Mighty Meeting: Generative AI, Cybersecurity Connect at RSA

A Mighty Meeting: Generative AI, Cybersecurity Connect at RSA

Cybersecurity experts at the RSA Conference this week will be on the hunt for ways to secure their operations in the era of generative AI.

They’ll find many of the latest tools use AI and accelerated computing. This intersection of security and AI is coming into focus with collaborations that companies like NVIDIA and its partners will describe at the event.

Data Science for a Data Problem

Machine learning is a great tool for cybersecurity because data is exploding.

“With more devices and users expanding the landscape to defend, cybersecurity has become a data problem; and AI is a data solution,” said David Reber, NVIDIA’s chief security officer.

Today, security analysts can be overwhelmed by a data tsunami. Generative AI can provide security copilots that act like filters, extracting from the firehose flow of information the context and anomalies that need a human expert’s attention.

Generative AI also lets security managers interrogate their data directly instead of setting rules, chasing alerts and tracking dashboards. In the age of AI, security experts will move from a command line to a conversational interface.

AI Puts Security Context on Steroids

This shift takes context-based security to a new level, according to a talk Reber delivered at GTC.

The potential is huge, but it requires work to unearth it. At GTC, Reber encouraged cybersecurity experts to begin working with AI, starting with low-risk use cases to identify and secure gaps.

He also provided suggestions for how to go about securing machine learning processes, saying security experts need to:

  • secure data supply chains,
  • develop tests for securing AI models and datasets across the development lifecycle,
  • use model cards, data cards and software bills of materials to provide AI transparency and reliability,
  • participate in community testing events such as security hackathons, and
  • update policies on how to respond to AI security events.

Foundations for AI Cybersecurity

To give users a leg up, NVIDIA provides NVIDIA Morpheus, a cybersecurity AI framework that filters and classifies large volumes of real-time data. Morpheus, part of the NVIDIA AI Enterprise software suite, lets developers build applications that can detect spear phishing, insider threats and more.

Users can employ Morpheus with NVIDIA NIM and NeMo Retriever, microservices from the NVIDIA API Catalog for rapidly deploying AI. The combination can unlock new use cases, such as reducing from days to seconds the time to find and resolve common software vulnerabilities and exposures, one of many NVIDIA AI workflows.

A new release of NVIDIA DOCA — the software framework for programming NVIDIA BlueField DPUs and NVIDIA ConnectX NICs — provides another foundation for AI security. It now sports updated encryption features for network and storage data.

An Expanding AI Security Ecosystem

At RSA, many companies will show products built on NVIDIA technologies that extend security for the generative AI era, including:

  • AIC will demonstrate Qrypt’s key generation for quantum secure encryption running on a BlueField DPU in an AIC-designed server.
  • Anjuna will discuss how the U.S. Navy is evaluating confidential computing on the Anjuna Seaglass platform with proprietary LLMs running on NVIDIA H100 Tensor Core GPUs.
  • Bloombase will show an updated version of its StoreSafe Storage Firewall powered by Morpheus and BlueField DPUs and new use cases for threat detection and fast, quantum-safe encryption of AI models and data.
  • Check Point Software will show its AI Cloud Protect security solution on BlueField DPUs, Quantum Force security gateways on ConnectX NICs, and Quantum Maestro software on NVIDIA Spectrum switches.
  • Cisco will feature Cisco Hypershield, an AI-native security architecture, to protect against both known and unknown attacks. It will also discuss its expanding partnership with NVIDIA to help customers harness the power of AI.
  • CrowdStrike will show its CrowdStrike Falcon Foundry and CrowdStrike Falcon platform that employs NVIDIA’s GPU-optimized AI software, including NIM microservices.
  • Deloitte will showcase CyberSphere, a cyber operations platform that uses Morpheus to speed detection of cyber threats.
  • Palo Alto Networks will describe its collaboration with NVIDIA on two use cases, a next-generation reference architecture for securing generative AI deployments with NIM and its VM-Series Virtual Next-Generation Firewall, with expanded intelligent traffic offload (ITO), supporting BlueField-3 DPUs.
  • Qrypt will demonstrate its quantum-secure key generation for securing AI workloads running on BlueField-3 DPUs using DOCA.
  • Sygnia will announce the use of BlueField DPUs and Morpheus in Velocity Edge, its new hardware-accelerated MXDR service for the energy and industrial sectors.

They are part of the NVIDIA ecosystem building a new layer of security for generative AI. That community includes more than 20 companies at RSA this week from NVIDIA Inception, a program for cutting-edge startups.

At RSA, Daniel Rohrer, vice president of software product security at NVIDIA, will be part of the keynote panel on AI safety.

In addition, Kevin Deierling, senior vice president of networking at NVIDIA, will share insights on security at the Cloudflare Executive Supper Club. And NVIDIA will participate in an event about women in cybersecurity.

To get started with AI-powered cybersecurity, try a workflow in NVIDIA LaunchPad.

Read More

NVIDIA and Alphabet’s Intrinsic Put Next-Gen Robotics Within Grasp

NVIDIA and Alphabet’s Intrinsic Put Next-Gen Robotics Within Grasp

Intrinsic, a software and AI robotics company at Alphabet, has integrated NVIDIA AI and Isaac platform technologies to advance the complex field of autonomous robotic manipulation.

This week at the Automate trade show, in Chicago, Intrinsic is spotlighting leaps in robotic grasping and industrial scalability assisted by foundation models enabled by NVIDIA Isaac Manipulator, unlocking new value in industrial automation with AI.

NVIDIA unveiled Isaac Manipulator at GTC in March. Isaac Manipulator is a collection of foundation models and modular GPU-accelerated libraries that help industrial automation companies build scalable and repeatable workflows for dynamic manipulation tasks by accelerating AI model training and task reprogramming.

Foundation models are based on a transformer deep learning architecture that allows a neural network to learn by tracking relationships in data. They’re generally trained on huge datasets and can be used to process and understand sensor and robot information as magically as ChatGPT for text. This enables robot perception and decision-making like never before and provides zero-shot learning — the ability to perform tasks without prior examples.

NVIDIA’s collaboration with Intrinsic, a leading robotics software and AI company,  demonstrates the potential for a universally applicable robotic-grasping skill to work across grippers, environments and objects.

“For the broader industry, our work with NVIDIA shows how foundation models can have a profound impact, including making today’s processing challenges easier to manage at scale, creating previously infeasible applications, reducing development costs, and increasing flexibility for end users,” said Wendy Tan White, CEO at Intrinsic, in a blog post announcing the collaboration with NVIDIA.  (White will deliver a keynote address at Automate about what the rise of AI means for innovation and growth, on Thursday, May 9, at 7 a.m. PT.)

Developing Better Robot Grip With Isaac Manipulator

Grasping has been a long sought after robotics skill. So far it’s been time-consuming, expensive to program and difficult to scale. As a result, many repetitive pick-and-place conditions haven’t been seamlessly handled to date by robots.

Simulation is changing that. Enlisting NVIDIA Isaac Sim on the NVIDIA Omniverse platform, Intrinsic generated synthetic data for vacuum grasping using computer-aided design models of sheet metal and suction grippers. This allowed Intrinsic to create a prototype for its customer Trumpf Machine Tools, a leading maker of industrial machine tools.

The prototype uses Intrinsic Flowstate, a developer environment for AI-based robotics solutions, for visualizing processes, associated perception and motion planning. With a workflow that includes Isaac Manipulator, one can generate grasp poses and CUDA-accelerated robot motions, which can first be evaluated in simulation with Isaac Sim — a cost-saving step — before deployment in the real world with the Intrinsic platform.

Under the collaboration, NVIDIA and Intrinsic plan to bring state-of-the-art dexterity and modular AI capabilities for robotic arms, with a robust collection of foundation models and GPU-accelerated libraries to accelerate a greater number of new robotics and automation tasks.

On Tuesday, May 7, at 11 a.m. CT, NVIDIA Senior Research Scientist Adithya Murali and Intrinsic Chief Science Officer Torsten Kroeger will demonstrate the companies’ work in the session “Automating Smart Pick-and-Place With Intrinsic Flowstate and NVIDIA Isaac Manipulator ” in the Intrinsic booth 2808 at Automate. Join  our speaking sessions at Automate.

Read More