EBSCOlearning scales assessment generation for their online learning content with generative AI

EBSCOlearning scales assessment generation for their online learning content with generative AI

EBSCOlearning offers corporate learning and educational and career development products and services for businesses, educational institutions, and workforce development organizations. As a division of EBSCO Information Services, EBSCOlearning is committed to enhancing professional development and educational skills.

In this post, we illustrate how EBSCOlearning partnered with AWS Generative AI Innovation Center (GenAIIC) to use the power of generative AI in revolutionizing their learning assessment process. We explore the challenges faced in traditional question-answer (QA) generation and the innovative AI-driven solution developed to address them.

In the rapidly evolving landscape of education and professional development, the ability to effectively assess learners’ understanding of content is crucial. EBSCOlearning, a leader in the realm of online learning, recognized this need and embarked on an ambitious journey to transform their assessment creation process using cutting-edge generative AI technology.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI, and is well positioned to address these types of tasks.

The challenge: Scaling quality assessments

EBSCOlearning’s learning paths—comprising videos, book summaries, and articles—form the backbone of a multitude of educational and professional development programs. However, the company faced a significant hurdle: creating high-quality, multiple-choice questions for these learning paths was a time-consuming and resource-intensive process.

Traditionally, subject matter experts (SMEs) would meticulously craft each question set to be relevant, accurate, and to align with learning objectives. Although this approach guaranteed quality, it was slow, expensive, and difficult to scale. As EBSCOlearning’s content library continues to grow, so does the need for a more efficient solution.

Enter AI: A promising solution

Recognizing the potential of AI to address this challenge, EBSCOlearning partnered with the GenAIIC to develop an AI-powered question generation system. The goal was ambitious: to create an automated solution that could produce high-quality, multiple-choice questions at scale, while adhering to strict guidelines on bias, safety, relevance, style, tone, meaningfulness, clarity, and diversity, equity, and inclusion (DEI). The QA pairs had to be grounded in the learning content and test different levels of understanding, such as recall, comprehension, and application of knowledge. Additionally, explanations were needed to justify why an answer was correct or incorrect.

The team faced several key challenges:

  • Making sure AI-generated questions matched the quality of human-created ones
  • Developing a system that could handle diverse content types
  • Implementing robust quality control measures
  • Creating a scalable solution that could grow with EBSCOlearning’s needs

Crafting the AI solution

The GenAIIC team developed a sophisticated pipeline using the power of large language models (LLMs), specifically Anthropic’s Claude 3.5 Sonnet in Amazon Bedrock. This pipeline is illustrated in the following figure and consists of several key components: QA generation, multifaceted evaluation, and intelligent revision.

QA generation

The process begins with the QA generation component. This module takes in the learning content—which could be a video transcript, book summary, or article—and generates an initial set of multiple-choice questions using in-context learning.

EBSCOlearning experts and GenAIIC scientists worked together to develop a sophisticated prompt engineering approach using Anthropic’s Claude 3.5 Sonnet model in Amazon Bedrock. To align with EBSCOlearning’s high standards, the prompt includes:

  • Detailed guidelines on what constitutes a high-quality question, covering aspects such as relevance, clarity, difficulty level, and objectivity
  • Instructions to match the conversational style of the original content
  • Directives to include diversity and inclusivity in the language and scenarios used
  • Few-shot examples to enable in-context learning for the AI model

The system aims to generate up to seven questions for each piece of content, each with four answer choices including a correct answer, and detailed explanations for why each answer is correct or incorrect.

Multifaceted evaluation

After the initial set of questions is generated, it undergoes a rigorous evaluation process. This multifaceted approach makes sure that the questions adhere to all quality standards and guidelines. The evaluation process includes three phases: LLM-based guideline evaluation, rule-based checks, and a final evaluation.

LLM-based guideline evaluation

In collaboration with EBSCOlearning, the GenAIIC team manually developed a comprehensive set of evaluation guidelines covering fundamental requirements for multiple-choice questions, such as validity, accuracy, and relevance. Additionally, they incorporated EBSCOlearning’s specific standards for diversity, equity, inclusion, and belonging (DEIB), in addition to style and tone preferences. The AI system evaluates each question according to the established guidelines and generates a structured output that includes detailed reasoning along with a rating on a three-point scale, where 1 indicates invalid, 2 indicates partially valid, and 3 indicates valid. This rating is later used for revising the questions.

This process presented several significant challenges. The primary challenge was making sure that the AI model could effectively assess multiple complex guidelines simultaneously without overlooking any crucial aspects. This was particularly difficult because the evaluation needed to consider so many different factors—all while maintaining consistency across questions.

To overcome the challenge of LLMs potentially overlooking guidelines when presented with them all at one time, the evaluation process was split into smaller manageable tasks by getting the AI model to focus on fewer guidelines at a time or evaluating smaller chunks of questions in parallel. This way, each guideline receives focused attention, resulting in a more accurate and comprehensive evaluation. Additionally, the system was designed with modularity in mind, streamlining the addition or removal of guidelines. Because of this flexibility, the evaluation process can adapt quickly to new requirements or changes in educational standards.

By generating detailed, structured feedback for each question, including numerical ratings, concise summaries, and in-depth reasoning, the system provides invaluable insights for continual improvement. This level of detail allows for a nuanced understanding of how well each question aligns with the established criteria, offering possibilities for targeted enhancements to the question generation process.

Rule-based checks

Some quantitative aspects of question quality proved challenging for the AI to consistently evaluate. For instance, the team noticed that correct answers were often longer than incorrect ones, making them straightforward to identify. To address this, they developed a custom algorithm that analyzes answer lengths and flags potential issues without relying on the LLM’s judgment.

Final evaluation

Beyond evaluating individual questions, the system also assesses the entire set of questions for a given piece of content. This step checks for duplicates, promotes diversity in the types of questions asked, and verifies that the set as a whole provides a comprehensive assessment of the learning material.

Intelligent revision

One key component of the pipeline is the intelligent revision module. This is where the iterative improvement happens. When the evaluation process flags issues with a question—whether it’s a guideline violation or a structural problem—the question is sent back for revision. The AI model is provided with specific feedback on how it can address the specific violation and directs it to fix the issue by revising or replacing the QA.

The power of iteration

The whole pipeline goes through multiple iterations until the question aligns with all of the specified quality standards. If after several attempts a question still doesn’t meet the criteria, it’s flagged for human review. This iterative approach makes sure that the final output isn’t merely a raw AI generation, but a refined product that has gone through multiple checks and improvements.

Throughout the development process, the team maintained a strong focus on iterative tracking of changes. They implemented a unique history tracking system so they could monitor the evolution of each question through multiple rounds of generation, evaluation, and revision. This approach not only provided valuable insights into the AI model’s decision-making process, but also allowed for targeted improvements to the system over time. By closely tracking the AI model’s performance across multiple iterations, we were able to fine-tune our prompts and evaluation criteria, resulting in a significant improvement in output quality.

Scalability and robustness

With EBSCOlearning’s vast content library in mind, the team built scalability into the core of their solution. They implemented multithreading capabilities, allowing the system to process multiple pieces of content simultaneously. They also developed sophisticated retry mechanisms to handle potential API failures or invalid outputs so the system remained reliable even when processing large volumes of content.

The results: A game-changer for learning assessment

By combining these components—intelligent generation, comprehensive evaluation, and adaptive revision—EBSCOlearning and the GenAIIC team created a system that not only automates the process of creating assessment questions but does so with a level of quality that rivals human-created content. This pipeline represents a significant leap forward in the application of AI to educational content creation, promising to revolutionize how learning assessments are developed and delivered.

The impact of this AI-powered solution on EBSCOlearning’s assessment creation process has been nothing short of transformative. Feedback from EBSCOlearning’s subject matter experts has been overwhelmingly positive, with the AI-generated questions meeting or exceeding the quality of manually created ones in many cases.

Key benefits of the new system include:

  • Dramatically reduced time and cost for assessment creation
  • Consistent quality across a wide range of subjects and content types
  • Improved scalability, allowing EBSCOlearning to keep pace with their growing content library
  • Enhanced learning experiences for end users, with more diverse and engaging assessments

“The Generative AI Innovation Center’s automated solution for generating multiple-choice questions and answers considerably accelerated the timeline for deployment of assessments for our online learning platform. Their approach of leveraging advanced language models and implementing carefully constructed guidelines in collaboration with our subject matter experts and product management team has resulted in assessment material that is accurate, relevant, and of high quality. This solution is saving us considerable time and effort, and will enable us to scale assessments across the wide range of skills development resources on our platform.”

—Michael Laddin, Senior Vice President & General Manager, EBSCOlearning.

Here are two examples of generated QA.

Question 1: What does the Consumer Relevancy model proposed by Crawford and Mathews assert about human values in business transactions?

  • A. Human values are less important than monetary value.
  • B. Human values and monetary value are equally important in all transactions.
  • C. Human values are more important than traditional value propositions.
  • D. Human values are irrelevant compared to monetary value in business transactions.

Correct answer: C

Answer explanations:

  • A. This is contrary to what the Consumer Relevancy model asserts. The model emphasizes the importance of human values over traditional value propositions.
  • B. While this might seem balanced, it doesn’t reflect the emphasis placed on human values in the Consumer Relevancy model. The model asserts that human values are more important.
  • C. This correctly reflects the assertion of the Consumer Relevancy model as described in the Book Summary. The model emphasizes the importance of human values over traditional value propositions.
  • D. This is the opposite of what the Consumer Relevancy model asserts. The model emphasizes the importance of human values, not their irrelevance.

Overall: The Book Summary states that the Consumer Relevancy model asserts that human values are more important than traditional value propositions and that businesses must recognize this need for human values as the contemporary currency of commerce.

Question 2: According to Sara N. King, David G. Altman, and Robert J. Lee, what is the primary benefit of leaders having clarity about their own values?

  • A. It guarantees success in all leadership positions.
  • B. It helps leaders make more fulfilling career choices.
  • C. It eliminates all conflicts in the workplace environment.
  • D. It ensures leaders always make ethically perfect decisions.

Correct answer: B

Answer explanations:

  • A. The Book Summary does not suggest that clarity about values guarantees success in all leadership positions. This is an overstatement of the benefits.
  • B. This is the correct answer, as the Book Summary states that clarity about values allows leaders to make more fulfilling career choices and recognize conflicts with their core values.
  • C. While understanding one’s values can help in managing conflicts, the Book Summary does not claim it eliminates all workplace conflicts. This is an exaggeration of the benefits.
  • D. The Book Summary does not state that clarity about values ensures leaders always make ethically perfect decisions. This is an unrealistic expectation not mentioned in the content.

Overall: The Book Summary emphasizes that having clarity about one’s own values allows leaders to make more fulfilling career choices and helps them recognize when they are participating in actions that conflict with their core values.

Looking to the future

For EBSCOlearning, this project is the first step towards their goal of scaling assessments across their entire online learning platform. They’re already planning to expand the system’s capabilities, including:

  • Adapting the solution to handle more complex, technical content
  • Incorporating additional question types beyond multiple-choice
  • Exploring ways to personalize assessments based on individual learner profiles

The potential applications of this technology extend far beyond EBSCOlearning’s current use case. From personalized learning paths to adaptive testing, the possibilities for AI in education and professional development are vast and exciting.

Conclusion: A new era of learning assessment

The collaboration between EBSCOlearning and the GenAIIC demonstrates the transformative power of AI when applied thoughtfully to real-world challenges. By combining cutting-edge technology with deep domain expertise, they’ve created a solution that not only solves a pressing business need but also has the potential to enhance learning experiences for millions of people. This solution is slated to produce assessment questions for hundreds and eventually thousands of learning paths in EBSCOlearning’s curriculum.

As we look to the future of education and professional development, it’s clear that AI will play an increasingly important role. The success of this project serves as a compelling example of how AI can be used to create more efficient, effective, and engaging learning experiences.

For businesses and educational institutions alike, the message is clear: embracing AI isn’t just about keeping up with technology trends—it’s about unlocking new possibilities to better serve learners and drive innovation in education. As EBSCOlearning’s journey shows, the future of learning assessment is here, and it’s powered by AI. Consider how such a solution can enrich your own e-learning content and delight your customers with high quality and on-point assessments. To get started, contact your AWS account manager. If you don’t have an AWS account manager, contact sales. Visit Generative AI Innovation Center to learn more about our program.


About the authors

Yasin Khatami is a Senior Applied Scientist at the Generative AI Innovation Center. With more than a decade of experience in artificial intelligence (AI), he implements state-of-the-art AI products for AWS customers to drive innovation, efficiency and value for customer platforms. His expertise is in generative AI, large language models (LLM), multi-agent techniques, and multimodal learning.

Yifu Hu is an Applied Scientist at the Generative AI Innovation Center. He develops machine learning and generative AI solutions for diverse customer challenges across various industries. Yifu specializes in creative problem-solving, with expertise in AI/ML technologies, particularly in applications of large language models and AI agents.

Aude Genevay is a Senior Applied Scientist at the Generative AI Innovation Center, where she helps customers tackle critical business challenges and create value using generative AI. She holds a PhD in theoretical machine learning and enjoys turning cutting-edge research into real-world solutions.

Mike Laddin is Senior Vice President & General Manager of EBSCOlearning, a division of EBSCO Information Services. EBSCOlearning offers highly acclaimed online products and services for companies, educational institutions, and workforce development organizations. Mike oversees a team of professionals focused on unlocking the potential of people and organizations with on-demand upskilling and microlearning solutions. He has over 25 years of experience as both an entrepreneur and software executive in the information services industry. Mike received an MBA from the Lally School of Management at Rensselaer Polytechnic Institute, and outside of work he is an avid boater.

Alyssa Gigliotti is a Content Strategy and Product Operations Manager at EBSCOlearning, where she collaborates with her team to design top-tier microlearning solutions focused on enhancing business and power skills. With a background in English, professional writing, and technical communications from UMass Amherst, Alyssa combines her expertise in language with a strategic approach to educational content. Alyssa’s in-depth knowledge of the product and voice of the customer allows her to actively engage in product development planning to ensure her team continuously meets the needs of users. Outside the professional sphere, she is both a talented artist and a passionate reader, continuously seeking inspiration from creative and literary pursuits.

Read More

Built for the Era of AI, NVIDIA RTX AI PCs Enhance Content Creation, Gaming, Entertainment and More

Built for the Era of AI, NVIDIA RTX AI PCs Enhance Content Creation, Gaming, Entertainment and More

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, software, tools and accelerations for GeForce RTX PC and NVIDIA RTX workstation users.

NVIDIA and GeForce RTX GPUs are built for the era of AI.

RTX GPUs feature specialized AI Tensor Cores that can deliver more than 1,300 trillion operations per second (TOPS) of processing power for cutting-edge performance in gaming, creating, everyday productivity and more. Today there are more than 600 deployed AI-powered games and apps that are accelerated by RTX.

RTX AI PCs can help anyone start their AI journey and supercharge their work.

Every RTX AI PC comes with regularly updated NVIDIA Studio Drivers — fine-tuned in collaboration with developers — that enhance performance in top creative apps and are tested extensively to deliver maximum stability. Download the December Studio Driver today.

The importance of large language models (LLM) continues to grow. Two benchmarks were introduced this week to spotlight LLM performance on various hardware: MLPerf Client v0.5 and Procyon AI Text Generation. These LLM-based benchmarks, which internal tests have shown accurately replicate real-world performance, are easy to run.

This holiday season, content creators can participate in the #WinterArtChallenge, running through February. Share winter-themed art on Facebook, Instagram or X with #WinterArtChallenge for a chance to be featured on NVIDIA Studio social media channels.

Advanced AI

With NVIDIA and GeForce RTX GPUs, AI elevates everyday tasks and activities, as covered in our AI Decoded blog series. For example, AI can enable:

Faster creativity: With Stable Diffusion, users can quickly create and refine images from text prompts to achieve their desired output. When using an RTX GPU, these results can be generated up to 2.2x faster than on an NPU. And thanks to software optimizations using the NVIDIA TensorRT SDK, the applications used to run these models, like ComfyUI, get an additional 60% boost.

Greater gaming: NVIDIA DLSS technology boosts frame rates and improves image quality, using AI to automatically generate pixels in video games. With ongoing improvements, including to Ray Reconstruction, DLSS enables richer visual quality for more immersive gameplay.

Enhanced entertainment: RTX Video Super Resolution uses AI to enhance video by removing compression artifacts and sharpening edges while upscaling video quality. RTX Video HDR converts any standard dynamic range video into vibrant high dynamic range, enabling more vivid, dynamic colors when streamed in Google Chrome, Microsoft Edge, Mozilla Firefox or VLC media player.

Improved productivity: The NVIDIA ChatRTX tech demo app connects a large language model, like Meta’s Llama, to a user’s data for quickly querying notes, documents or images. Free for RTX GPU owners, the custom chatbot provides quick, contextually relevant answers. Since it runs locally on Windows RTX PCs and workstations, results are fast and private.

This snapshot of AI capabilities barely scratches the surface of the technology’s possibilities. With an NVIDIA or GeForce RTX GPU-powered system, users can also supercharge their STEM studies and research, and tap into the NVIDIA Studio suite of AI-powered tools.

Decisions, Decisions

More than 200 powerful RTX AI PCs are capable of running advanced AI.

ASUS’ Vivobook Pro 16X comes with up to a GeForce RTX 4070 Laptop GPU.

ASUS’ Vivobook Pro 16X comes with up to a GeForce RTX 4070 Laptop GPU, as well as a superbright 550-nit panel, ultrahigh contrast ratio and ultrawide 100% DCI-P3 color gamut. It’s available on Amazon and ASUS.com.

Dell’s Inspiron 16 Plus 7640 comes with up to a GeForce RTX 4060 Laptop GPU.

Dell’s Inspiron 16 Plus 7640 comes with up to a GeForce RTX 4060 Laptop GPU and a 16:10 aspect ratio display, ideal for users working on multiple projects. It boasts military-grade testing for added reliability and an easy-to-use, built-in Trusted Platform Module to protect sensitive data. It’s available on Amazon and Dell.com.

GIGABYTE’s AERO 16 OLED comes with up to a GeForce RTX 4070 Laptop GPU.

GIGABYTE’s AERO 16 OLED, equipped with up to a GeForce RTX 4070 Laptop GPU, is designed for professionals, designers and creators. The 16:10 thin-bezel 4K+ OLED screen is certified by multiple third parties to provide the best visual experience with X-Rite 2.0 factory-by-unit color calibration and Pantone Validated color calibration. It’s available on Amazon and GIGABYTE.com.

MSI’s Creator M14 comes with up to a GeForce RTX 4070 Laptop GPU.

MSI’s Creator M14 comes with up to a GeForce RTX 4070 Laptop GPU, delivering a quantum leap in performance with DLSS 3 to enable lifelike virtual worlds with full ray tracing. Plus, its Max-Q suite of technologies optimizes system performance, power, battery life and acoustics for peak efficiency. Purchase one on Amazon or MSI.com.

These are just a few of the many RTX AI PCs available, with some on sale, including the Acer Nitro V, ASUS TUF 16″, HP Envy 16″ and Lenovo Yoga Pro 9i.

Follow NVIDIA Studio on Facebook, Instagram and X. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Generative AI is transforming gaming, videoconferencing and interactive experiences of all kinds. Make sense of what’s new and what’s next by subscribing to the AI Decoded newsletter.

Read More

Into the Omniverse: How OpenUSD-Based Simulation and Synthetic Data Generation Advance Robot Learning

Into the Omniverse: How OpenUSD-Based Simulation and Synthetic Data Generation Advance Robot Learning

Editor’s note: This post is part of Into the Omniverse, a series focused on how developers, 3D practitioners, and enterprises can transform their workflows using the latest advances in OpenUSD and NVIDIA Omniverse.

Scalable simulation technologies are driving the future of autonomous robotics by reducing development time and costs.

Universal Scene Description (OpenUSD) provides a scalable and interoperable data framework for developing virtual worlds where robots can learn how to be robots. With SimReady OpenUSD-based simulations, developers can create limitless scenarios based on the physical world.

And NVIDIA Isaac Sim is advancing perception AI-based robotics simulation. Isaac Sim is a reference application built on the NVIDIA Omniverse platform for developers to simulate and test AI-driven robots in physically based virtual environments.

At AWS re:Invent, NVIDIA announced that Isaac Sim is now available on Amazon EC2 G6e instances powered by NVIDIA L40S GPUs. These powerful instances enhance the performance and accessibility of Isaac Sim, making high-quality robotics simulations more scalable and efficient.

These advancements in Isaac Sim mark a significant leap for robotics development. By enabling realistic testing and AI model training in virtual environments, companies can reduce time to deployment and improve robot performance across a variety of use cases.

Advancing Robotics Simulation With Synthetic Data Generation

Robotics companies like Cobot, Field AI and Vention are using Isaac Sim to simulate and validate robot performance while others, such as SoftServe and Tata Consultancy Services, use synthetic data to bootstrap AI models for diverse robotics applications.

The evolution of robot learning has been deeply intertwined with simulation technology. Early experiments in robotics relied heavily on labor-intensive, resource-heavy trials. Simulation is a crucial tool for the creation of physically accurate environments where robots can learn through trial and error, refine algorithms and even train AI models using synthetic data.

Physical AI describes AI models that can understand and interact with the physical world. It embodies the next wave of autonomous machines and robots, such as self-driving cars, industrial manipulators, mobile robots, humanoids and even robot-run infrastructure like factories and warehouses.

Robotics simulation, which forms the second computer in the three computer solution, is a cornerstone of physical AI development that lets engineers and researchers design, test and refine systems in a controlled virtual environment.

A simulation-first approach significantly reduces the cost and time associated with physical prototyping while enhancing safety by allowing robots to be tested in scenarios that might otherwise be impractical or hazardous in real life.

With a new reference workflow, developers can accelerate the generation of synthetic 3D datasets with generative AI using OpenUSD NIM microservices. This integration streamlines the pipeline from scene creation to data augmentation, enabling faster and more accurate training of perception AI models.

Synthetic data can help address the challenge of limited, restricted or unavailable data needed to train various types of AI models, especially in computer vision. Developing action recognition models is a common use case that can benefit from synthetic data generation.

To learn how to create a human action recognition video dataset with Isaac Sim, check out the technical blog on Scaling Action Recognition Models With Synthetic Data. 3D simulations offer developers precise control over image generation, eliminating hallucinations.

Robotic Simulation for Humanoids

Humanoid robots are the next wave of embodied AI, but they present a challenge at the intersection of mechatronics, control theory and AI. Simulation is crucial to solving this challenge by providing a safe, cost-effective and versatile platform for training and testing humanoids.

With NVIDIA Isaac Lab, an open-source unified framework for robot learning built on top of Isaac Sim, developers can train humanoid robot policies at scale via simulations. Leading commercial robot makers are adopting Isaac Lab to handle increasingly complex movements and interactions.

NVIDIA Project GR00T, an active research initiative to enable the humanoid robot ecosystem of builders, is pioneering workflows such as GR00T-Gen to generate robot tasks and simulation-ready environments in OpenUSD. These can be used for training generalist robots to perform manipulation, locomotion and navigation.

Recently published research from Project GR00T also shows how advanced simulation can be used to train interactive humanoids. Using Isaac Sim, the researchers developed a single unified controller for physically simulated humanoids called MaskedMimic. The system is capable of generating a wide range of motions across diverse terrains from intuitive user-defined intents.

Physics-Based Digital Twins Simplify AI Training

Partners across industries are using Isaac Sim, Isaac Lab, Omniverse, and OpenUSD to design, simulate and deploy smarter, more capable autonomous machines:

  • Agility uses Isaac Lab to create simulations that let simulated robot behaviors transfer directly to the robot, making it more intelligent, agile and robust when deployed in the real world.
  • Cobot uses Isaac Sim with its AI-powered cobot, Proxie, to optimize logistics in warehouses, hospitals, manufacturing sites and more.
  • Cohesive Robotics has integrated Isaac Sim into its software framework called Argus OS for developing and deploying robotic workcells used in high-mix manufacturing environments.
  • Field AI, a builder of robot foundation models, uses Isaac Sim and Isaac Lab to evaluate the performance of its models in complex, unstructured environments across industries such as construction, manufacturing, oil and gas, mining, and more.
  • Fourier uses NVIDIA Isaac Gym and Isaac Lab to train its GR-2 humanoid robot, using reinforcement learning and advanced simulations to accelerate development, enhance adaptability and improve real-world performance.
  • Foxglove integrates Isaac Sim and Omniverse to enable efficient robot testing, training and sensor data analysis in realistic 3D environments.
  • Galbot used Isaac Sim to verify the data generation of DexGraspNet, a large-scale dataset of 1.32 million ShadowHand grasps, advancing robotic hand functionality by enabling scalable validation of diverse object interactions across 5,355 objects and 133 categories.
  • Standard Bots is simulating and validating the performance of its R01 robot used in manufacturing and machining setups.
  • Wandelbots integrates its NOVA platform with Isaac Sim to create physics-based digital twins and intuitive training environments, simplifying robot interaction and enabling seamless testing, validation and deployment of robotic systems in real-world scenarios.

Learn more about how Wandelbots is advancing robot learning with NVIDIA technology in this livestream recording:

Get Plugged Into the World of OpenUSD

NVIDIA experts and Omniverse Ambassadors are hosting livestream office hours and study groups to provide robotics developers with technical guidance and troubleshooting support for Isaac Sim and Isaac Lab. Learn how to get started simulating robots in Isaac Sim with this new, free course on NVIDIA Deep Learning Institute (DLI).

For more on optimizing OpenUSD workflows, explore the new self-paced Learn OpenUSD training curriculum that includes free DLI courses for 3D practitioners and developers. For more resources on OpenUSD, explore the Alliance for OpenUSD forum and the AOUSD website.

Don’t miss the CES keynote delivered by NVIDIA founder and CEO Jensen Huang live in Las Vegas on Monday, Jan. 6, at 6:30 p.m. PT for more on the future of AI and graphics.

Stay up to date by subscribing to NVIDIA news, joining the community, and following NVIDIA Omniverse on Instagram, LinkedIn, Medium and X.

Featured image courtesy of Fourier.

Read More

torchcodec: Easy and Efficient Video Decoding for PyTorch

torchcodec: Easy and Efficient Video Decoding for PyTorch

We are pleased to officially announce torchcodec, a library for decoding videos into PyTorch tensors. It is fast, accurate, and easy to use. When running PyTorch models on videos, torchcodec is our recommended way to turn those videos into data your model can use.

Highlights of torchcodec include:

  • An intuitive decoding API that treats a video file as a Python sequence of frames. We support both index-based and presentation-time-based frame retrieval.
  • An emphasis on accuracy: we ensure you get the frames you requested, even if your video has variable frame rates.
  • A rich sampling API that makes it easy and efficient to retrieve batches of frames.
  • Best-in-class CPU decoding performance.
  • CUDA accelerated decoding that enables high throughput when decoding many videos at once.
  • Support for all codecs available in your installed version of FFmpeg.
  • Simple binary installs for Linux and Mac.

Easy to Use

A simple, intuitive API was one of our main design principles. We start with simple decoding and extracting specific frames of a video:

from torchcodec.decoders import VideoDecoder
from torch import Tensor

decoder = VideoDecoder("my_video.mp4")

# Index based frame retrieval.
first_ten_frames: Tensor = decoder[10:]
last_ten_frames: Tensor = decoder[-10:]

# Multi-frame retrieval, index and time based.
frames = decoder.get_frames_at(indices=[10, 0, 15])
frames = decoder.get_frames_played_at(seconds=[0.2, 3, 4.5])

All decoded frames are already PyTorch tensors, ready to be fed into models for training.

Of course, more common in ML training pipelines is sampling multiple clips from videos. A clip is just a sequence of frames in presentation order—but the frames are often not consecutive. Our sampling API makes this easy:

from torchcodec.samplers import clips_at_regular_timestamps

clips = clips_at_regular_timestamps(
  decoder,
  seconds_between_clip_starts=10,
  num_frames_per_clip=5,
  seconds_between_frames=0.2,
)

The above call yields a batch of clips where each clip starts 10 seconds apart, each clip has 5 frames, and those frames are 0.2 seconds apart. See our tutorials on decoding and sampling for more!

Fast Performance

Performance was our other main design principle. Decoding videos for ML training has different performance requirements than decoding videos for playback. A typical ML video training pipeline will process many different videos (sometimes in the millions!), but only sample a small number of frames (dozens to hundreds) from each video.

For this reason, we’ve paid particular attention to our decoder’s performance when seeking multiple times in a video, decoding a small number of frames after each seek. We present experiments with the following four scenarios:

  1. Decoding and transforming frames from multiple videos at once, inspired by what we have seen in data loading for large-scale training pipelines:

    a. Ten threads decode batches of 50 videos in parallel.
    b. For each video, decode 10 frames at evenly spaced times.
    c. For each frame, resize it to a 256×256 resolution.

  2. Decoding 10 frames at random locations in a single video.
  3. Decoding 10 frames at evenly spaced times of a single video.
  4. Decoding the first 100 frames of a single video.

We compare the following video decoders:

  • Torchaudio, CPU decoding only.
  • Torchvision, using the video_reader backend which is CPU decoding only.
  • Torchcodec, GPU decoding with CUDA.
  • Torchcodec, CPU decoding only.

Using the following three videos:

  1. A synthetically generated video using FFmpeg’s mandelbrot generation pattern. The video is 10 seconds long, 60 frames per second and 1920×1080.
  2. Same as above, except the video is 120 seconds long.
  3. A promotional video from NASA that is 206 seconds long, 29.7 frames per second and 960×540.

The experimental script is in our repo. Our experiments run on a Linux system with an Intel processor that has 22 available cores and an NVIDIA GPU. For CPU decoding, all libraries were instructed to automatically determine the best number of threads to use.

Benchmark chart

From our experiments, we draw several conclusions:

  • Torchcodec is consistently the best-performing library for the primary use case we designed it for: decoding many videos at once as a part of a training data loading pipeline. In particular, high-resolution videos see great gains with CUDA where decoding and transforms both happen on the GPU.
  • Torchcodec is competitive on the CPU with seek-heavy use cases such as random and uniform sampling. Currently, torchcodec’s performance is better with shorter videos that have a smaller file size. This performance is due to torchcodec’s emphasis on seek-accuracy, which involves an initial linear scan.
  • Torchcodec is not as competitive when there is no seeking; that is, opening a video file and decoding from the beginning. This is again due to our emphasis on seek-accuracy and the initial linear scan.

Implementing an approximate seeking mode in torchcodec should resolve these performance gaps, and it’s our highest priority feature for video decoding.

What’s Next?

As the name implies, the long-term future for torchcodec is more than just video decoding. Our next big feature is audio support—both decoding audio streams from video, and from audio-only media. In the long term, we want torchcodec to be the media decoding library for PyTorch. That means as we implement functionality in torchcodec, we will deprecate and eventually remove complementary features from torchaudio and torchvision.

We also have video decoding improvements lined up, such as the previously mentioned approximate seeking mode for those who are willing to sacrifice accuracy for performance.

Most importantly, we’re looking for feedback from the community! We’re most interested in working on features that the community finds valuable. Come share your needs and influence our future direction!

Read More

BayesCNS: A Unified Bayesian Approach to Address Cold Start and Non-Stationarity in Search Systems at Scale

Information Retrieval (IR) systems used in search and recommendation platforms frequently employ Learning-to-Rank (LTR) models to rank items in response to user queries. These models heavily rely on features derived from user interactions, such as clicks and engagement data. This dependence introduces cold start issues for items lacking user engagement and poses challenges in adapting to non-stationary shifts in user behavior over time. We address both challenges holistically as an online learning problem and propose BayesCNS, a Bayesian approach designed to handle cold start and…Apple Machine Learning Research

AI Pioneers Win Nobel Prizes for Physics and Chemistry

AI Pioneers Win Nobel Prizes for Physics and Chemistry

Artificial intelligence, once the realm of science fiction, claimed its place at the pinnacle of scientific achievement Monday in Sweden.

In a historic ceremony at Stockholm’s iconic Konserthuset, John Hopfield and Geoffrey Hinton received the Nobel Prize in Physics for their pioneering work on neural networks — systems that mimic the brain’s architecture and form the bedrock of modern AI.

Meanwhile, Demis Hassabis and John Jumper accepted the Nobel Prize in Chemistry for Google DeepMind’s AlphaFold, a system that solved biology’s “impossible” problem: predicting the structure of proteins, a feat with profound implications for medicine and biotechnology.

These achievements go beyond academic prestige. They mark the start of an era where GPU-powered AI systems tackle problems once deemed unsolvable, revolutionizing multitrillion-dollar industries from healthcare to finance.

Hopfield’s Legacy and the Foundations of Neural Networks

In the 1980s, Hopfield, a physicist with a knack for asking big questions, brought a new perspective to neural networks.

He introduced energy landscapes — borrowed from physics — to explain how neural networks solve problems by finding stable, low-energy states. His ideas, abstract yet elegant, laid the foundation for AI by showing how complex systems optimize themselves.

Fast forward to the early 2000s, when Geoffrey Hinton — a British cognitive psychologist with a penchant for radical ideas — picked up the baton. Hinton believed neural networks could revolutionize AI, but training these systems required enormous computational power.

In 1983, Hinton and Sejnowski built on Hopfield’s work and invented the Boltzmann Machine which used stochastic binary neurons to jump out of local minima. They discovered an elegant and very simple learning procedure based on statistical mechanics which was an alternative to backpropagation.

In 2006 a simplified version of this learning procedure proved to be very effective at initializing deep neural networks before training them with backpropagation. However, training these systems still required enormous computational power.

AlphaFold: Biology’s AI Revolution

A decade after AlexNet, AI moved to biology. Hassabis and Jumper led the development of AlphaFold to solve a problem that had stumped scientists for years: predicting the shape of proteins.

Proteins are life’s building blocks. Their shapes determine what they can do. Understanding these shapes is the key to fighting diseases and developing new medicines. But finding them was slow, costly and unreliable.

AlphaFold changed that. It used Hopfield’s ideas and Hinton’s networks to predict protein shapes with stunning accuracy. Powered by GPUs, it mapped almost every known protein. Now, scientists use AlphaFold to fight drug resistance, make better antibiotics and treat diseases once thought to be incurable.

What was once biology’s Gordian knot has been untangled — by AI.

The GPU Factor: Enabling AI’s Potential

GPUs, the indispensable engines of modern AI, are at the heart of these achievements. Originally designed to make video games look good, GPUs were perfect for the massive parallel processing demands of neural networks.

NVIDIA GPUs, in particular, became the engine driving breakthroughs like AlexNet and AlphaFold. Their ability to process vast datasets with extraordinary speed allowed AI to tackle problems on a scale and complexity never before possible.

Redefining Science and Industry

The Nobel-winning breakthroughs of 2024 aren’t just rewriting textbooks — they’re optimizing global supply chains, accelerating drug development and helping farmers adapt to changing climates.

Hopfield’s energy-based optimization principles now inform AI-powered logistics systems. Hinton’s architectures underpin self-driving cars and language models like ChatGPT. AlphaFold’s success is inspiring AI-driven approaches to climate modeling, sustainable agriculture and even materials science.

The recognition of AI in physics and chemistry signals a shift in how we think about science. These tools are no longer confined to the digital realm. They’re reshaping the physical and biological worlds.

Read More

Pixtral 12B is now available on Amazon SageMaker JumpStart

Pixtral 12B is now available on Amazon SageMaker JumpStart

Today, we are excited to announce that Pixtral 12B (pixtral-12b-2409), a state-of-the-art vision language model (VLM) from Mistral AI that excels in both text-only and multimodal tasks, is available for customers through Amazon SageMaker JumpStart. You can try this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference.

In this post, we walk through how to discover, deploy, and use the Pixtral 12B model for a variety of real-world vision use cases.

Pixtral 12B overview

Pixtral 12B represents Mistral’s first VLM and demonstrates strong performance across various benchmarks, outperforming other open models and matching larger models, according to Mistral. Pixtral is trained to understand both images and documents, and shows strong abilities in vision tasks such as chart and figure understanding, document question answering, multimodal reasoning, and instruction following, some of which we demonstrate later in this post with examples. Pixtral 12B is able to ingest images at their natural resolution and aspect ratio. Unlike other open source models, Pixtral doesn’t compromise on text benchmark performance, such as instruction following, coding, and math, to excel in multimodal tasks.

Mistral designed a novel architecture for Pixtral 12B to optimize for both speed and performance. The model has two components: a 400-million-parameter vision encoder, which tokenizes images, and a 12-billion-parameter multimodal transformer decoder, which predicts the next text token given a sequence of text and images. The vision encoder was newly trained that natively supports variable image sizes, which allows Pixtral to be used to accurately understand complex diagrams, charts, and documents in high resolution, and provides fast inference speeds on small images like icons, clipart, and equations. This architecture allows Pixtral to process any number of images with arbitrary sizes in its large context window of 128,000 tokens.

License agreements are a critical decision factor when using open-weights models. Similar to other Mistral models, such as Mistral 7B, Mixtral 8x7B, Mixtral 8x22B and Mistral Nemo 12B, Pixtral 12B is released under the commercially permissive Apache 2.0, providing enterprise and startup customers with a high-performing VLM option to build complex multimodal applications.

SageMaker JumpStart overview

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker Inference instances, including AWS Trainium and AWS Inferentia powered instances, and are isolated within your virtual private cloud (VPC). This enforces data security and compliance, because the models operate under your own VPC controls, rather than in a shared public environment. After deploying an FM, you can further customize and fine-tune the model, including SageMaker Inference for deploying models and container logs for improved observability.With SageMaker, you can streamline the entire model deployment process. Note that fine-tuning on Pixtral 12B is not yet available (at the time of writing) on SageMaker JumpStart.

Prerequisites

To try out Pixtral 12B in SageMaker JumpStart, you need the following prerequisites:

Discover Pixtral 12B in SageMaker JumpStart

You can access Pixtral 12B through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an IDE that provides a single web-based visual interface where you can access purpose-built tools to perform ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio Classic.

  1. In SageMaker Studio, access SageMaker JumpStart by choosing JumpStart in the navigation pane.
  2. Choose HuggingFace to access the Pixtral 12B model.
  3. Search for the Pixtral 12B model.
  4. You can choose the model card to view details about the model such as license, data used to train, and how to use the model.
  5. Choose Deploy to deploy the model and create an endpoint.

Deploy the model in SageMaker JumpStart

Deployment starts when you choose Deploy. When deployment is complete, an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.

To deploy using the SDK, we start by selecting the Mistral Nemo Base model, specified by the model_id with the value huggingface-vlm-mistral-pixtral-12b-2409. You can deploy your choice of any of the selected models on SageMaker with the following code:

from sagemaker.jumpstart.model import JumpStartModel 

accept_eula = True 

model = JumpStartModel(model_id="huggingface-vlm-mistral-pixtral-12b-2409") 
predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The end-user license agreement (EULA) value must be explicitly defined as True in order to accept the EULA. Also, make sure that you have the account-level service limit for using ml.p4d.24xlarge or ml.pde.24xlarge for endpoint usage as one or more instances. To request a service quota increase, refer to AWS service quotas. After you deploy the model, you can run inference against the deployed endpoint through the SageMaker predictor.

Pixtral 12B use cases

In this section, we provide examples of inference on Pixtral 12B with example prompts.

OCR

We use the following image as input for OCR.

We use the following prompt:

payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract and transcribe all text visible in the image, preserving its exact formatting, layout, and any special characters. Include line breaks and maintain the original capitalization and punctuation.",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "Pixtral_data/amazon_s1_2.jpg"
                    }
                }
            ]
        }
    ],
    "max_tokens": 2000,
    "temperature": 0.6,
    "top_p": 0.9,
}
print(response)
Approximate date of commencement of proposed sale to the public: AS SOON AS PRACTICABLE AFTER THIS REGISTRATION STATEMENT BECOMES EFFECTIVE. 
If any of the securities being registered on this Form are to be offered on a delayed or continuous basis pursuant to Rule 415 under the Securities Act of 1933, check the following box. 
[] If this Form is filed to register additional securities for an offering pursuant to Rule 462(b) under the Securities Act of 1933, check the following box and list the Securities Act registration statement number of the earlier effective registration statement for the same offering. 
[] If this Form is a post-effective amendment filed pursuant to Rule 462(c) under the Securities Act of 1933, check the following box and list the Securities Act registration statement number of the earlier effective registration statement for the same offering. 
[] If delivery of the prospectus is expected to be made pursuant to Rule 434, please check the following box. 
[] **CALCULATION OF REGISTRATION FEE** 
| TITLE OF EACH CLASS OF SECURITIES TO BE REGISTERED | AMOUNT TO BE REGISTERED(1) | PROPOSED MAXIMUM OFFERING PRICE PER SHARE(2) | PROPOSED MAXIMUM AGGREGATE OFFERING PRICE(2) | AMOUNT OF REGISTRATION FEE | 
|----------------------------------------------------|----------------------------|---------------------------------------------|---------------------------------------------|----------------------------| 
| Common Stock, $0.01 par value per share........... | 2,875,000 shares           | $14.00                                      | $40,250,000                                 | $12,197(3)                 | 

(1) Includes 375,000 shares that the Underwriters have the option to purchase to cover over-allotments, if any. 
(2) Estimated solely for the purpose of calculating the registration fee in accordance with Rule 457(c). 
(3) $11,326 of registration fee has been previously paid. ...

Chart understanding and analysis

For chart understanding and analysis, we use the following image as input.

We use the following prompt:

prompt= """
Analyze the attached image of the chart or graph. Your tasks are to:
Identify the type of chart or graph (e.g., bar chart, line graph, pie chart, etc.).
Extract the key data points, including labels, values, and any relevant scales or units.
Identify and describe the main trends, patterns, or significant observations presented in the chart.
Generate a clear and concise paragraph summarizing the extracted data and insights. The summary should highlight the most important information and provide an overview that would help someone understand the chart without seeing it.
Ensure that your summary is well-structured, accurately reflects the data, and is written in a professional tone.
"""
payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "Pixtral_data/amazon_s1_2.jpg"
                    }
                }
            ]
        }
    ],
    "max_tokens": 2000,
    "temperature": 0.6,
    "top_p": 0.9,
}
print(response)
image_path = "Pixtral_data/Amazon_Chart.png"  # Replace with your local image path
response = send_images_to_model(predictor, prompt, image_path)
print(response)

We get the following output:

The image is a bar chart titled "Segment Results – North America," which presents data on net sales and operating income over several quarters from Q2 2023 to Q2 2024. The chart is divided into two sections: one for net sales and the other for operating income.

### Key Data Points:
- Net Sales:
 - Q2 2023: $82,546 million
 - Q3 2023: Approximately $85,000 million
 - Q4 2023: Approximately $90,000 million
 - Q1 2024: Approximately $85,000 million
 - Q2 2024: $90,033 million
 - Year-over-Year (Y/Y) growth: 9%

- Operating Income:
 - Q2 2023: $3,211 million
 - Q3 2023: Approximately $4,000 million
 - Q4 2023: Approximately $7,000 million
 - Q1 2024: Approximately $5,000 million
 - Q2 2024: $5,065 million
 - Year-over-Year (Y/Y) growth: 58%

- Total Trailing Twelve Months (TTM):
 - Net Sales: $369.8 billion
 - Operating Income: $20.8 billion
...
- **Operating Income:** Operating income shows significant growth, particularly in Q4 2023, where it peaks. There is a notable year-over-year increase of 58%.

### Summary:
The bar chart illustrates the segment results for North America, focusing on net sales and operating income from Q2 2023 to Q2 2024. Net sales demonstrate a steady upward trend, culminating in a 9% year-over-year increase, with the highest value recorded in Q2 2024 at $90,033 million. Operating income exhibits more volatility, with a significant peak in Q4 2023, and an overall substantial year-over-year growth of 58%. The total trailing twelve months (TTM) figures indicate robust performance, with net sales reaching $369.8 billion and operating income at $20.8 billion. This data underscores a positive growth trajectory in both net sales and operating income for the North American segment over the observed period.

Image to code

For an image-to-code example, we use the following image as input.

We use the following prompt:

def extract_html(text):
 pattern = r'```htmls*(.*?)s*```'
 match = re.search(pattern, text, re.DOTALL)
 return match.group(1) if match else None
  
prompt = "Create HTML and CSS code for a minimalist and futuristic website to purchase luggage. Use the following image as template to create your own design."
payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "Pixtral_data/Amazon_Chart.png"
                    }
                }
            ]
        }
    ],
    "max_tokens": 2000,
    "temperature": 0.6,
    "top_p": 0.9,
}
print('Input Image:nn')
html_code = extract_html(response)
print(html_code)
display(HTML(html_code))
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Luggage Store</title>
    <link rel="stylesheet" href="styles.css">
</head>
<body>
    <header>
        <h1>Luggage Store</h1>
        <nav>
            <ul>
                <li><a href="#">Home</a></li>
                <li><a href="#">Products</a></li>
                <li><a href="#">About</a></li>
                <li><a href="#">Contact</a></li>
            </ul>
        </nav>
    </header>
...
        <p>&copy; 2023 Luggage Store. All rights reserved.</p>
    </footer>
</body>
</html>

Clean up

After you are done, delete the SageMaker endpoints using the following code to avoid incurring unnecessary costs:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Mistral’s newest multi-modal model, Pixtral 12B, in SageMaker JumpStart and deploy the model for inference. We also explored how SageMaker JumpStart empowers data scientists and ML engineers to discover, access, and deploy a wide range of pre-trained FMs for inference, including other Mistral AI models, such as Mistral 7B and Mixtral 8x22B.

For more information about SageMaker JumpStart, refer to Train, deploy, and evaluate pretrained models with SageMaker JumpStart and Getting started with Amazon SageMaker JumpStart to get started.

For more Mistral assets, check out the Mistral-on-AWS repo.


About the Authors

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

Niithiyn Vijeaswaran is a GenAI Specialist Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Shane Rai is a Principal GenAI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using the breadth of cloud-based AI/ML AWS services, including model offerings from top tier foundation model providers.

Read More

Talk to your slide deck using multimodal foundation models on Amazon Bedrock – Part 3

Talk to your slide deck using multimodal foundation models on Amazon Bedrock – Part 3

In this series, we share two approaches to gain insights on multimodal data like text, images, and charts. In Part 1, we presented an “embed first, infer later” solution that uses the Amazon Titan Multimodal Embeddings foundation model (FM) to convert individual slides from a slide deck into embeddings. We stored the embeddings in a vector database and then used the Large Language-and-Vision Assistant (LLaVA 1.5-7b) model to generate text responses to user questions based on the most similar slide retrieved from the vector database. Part 1 uses AWS services including Amazon Bedrock, Amazon SageMaker, and Amazon OpenSearch Serverless.

In Part 2, we demonstrated a different approach: “infer first, embed later.” We used Anthropic’s Claude 3 Sonnet on Amazon Bedrock to generate text descriptions for each slide in the slide deck. These descriptions are then converted into text embeddings using the Amazon Titan Text Embeddings model and stored in a vector database. Then we used Anthropic’s Claude 3 Sonnet to generate answers to user questions based on the most relevant text description retrieved from the vector database.

In this post, we evaluate the results from both approaches using ground truth provided by SlideVQA[1], an open source visual question answering dataset. You can test both approaches and evaluate the results to find the best fit for your datasets. The code for this series is available in the GitHub repo.

Comparison of approaches

SlideVQA is a collection of publicly available slide decks, each composed of multiple slides (in JPG format) and questions based on the information in the slide decks. It allows a system to select a set of evidence images and answer the question. We use SlideVQA as the single source of truth to compare the results. It’s important that you follow the Amazon Bedrock data protection policies when using public datasets.

This post follows the process depicted in the following diagram. For more details about the architecture, refer to the solution overview and design in Parts 1 and 2 of the series.

Process flow

We selected 100 random questions from SlideVQA to create a sample dataset to test solutions from Part 1 and Part 2.

The responses to the questions in the sample dataset are as concise as possible, as shown in the following example:

"question": "What is the process by which the breaking of hydrogen bonds allows water to change from the liquid phase into the gaseous phase which has reached equilibrium with the liquid surface said to have done?"

"answer": "reached saturation"

The responses from large language models (LLMs) are quite verbose:

According to the information provided in the images, the process by which the breaking of hydrogen bonds allows water to change from the liquid phase into the gaseous phase that has reached equilibrium with the liquid surface is said to have reached saturation.

The key points are:

1. Evaporation involves the breaking of hydrogen bonds that hold water molecules together in the liquid phase, allowing them to transition into the gaseous (vapor) phase.

2. Only the fastest moving water molecules with enough energy can overcome the hydrogen bonding and evaporate into the vapor phase.

3. The evaporation process that has reached equilibrium with the liquid surface, where the vapor pressure is balanced with the evaporation rate, is described as having reached saturation.

So in summary, the breaking of hydrogen bonds provides the mechanism for water molecules to gain enough energy to escape the liquid phase as vapor, and when this vapor has reached equilibrium with the liquid surface, it is said to have reached saturation.We updated the prompts in each approach to provide short responses instead of verbose responses. This helped match the output context length to the ground truth responses in the sample dataset.

The following sections briefly discuss the solutions and dive into the evaluation and pricing for each approach.

Approach 1: Embed first, infer later

Slide decks are converted into PDF images, one per slide, and embedded using the Amazon Titan Multimodal Embeddings model, resulting in a vector embedding of 1,024 dimensions. The embeddings are stored in an OpenSearch Serverless index, which serves as the vector store for our Retrieval Augmented Generation (RAG) solution. The embeddings are ingested using an Amazon OpenSearch Ingestion pipeline.

Each question is converted into embeddings using the Amazon Titan Multimodal Embeddings model, and an OpenSearch vector search is performed using these embeddings. We performed a k-nearest neighbor (k-NN) search to retrieve the most relevant embedding matching the question. The metadata of the response from the OpenSearch index contains a path to the image corresponding to the most relevant slide.

The following prompt is created by combining the question and the image path, and is sent to Anthropic’s Claude 3 Sonnet to respond to the question with a concise answer:

Human: Your role is to provide a precise answer to the question in the <question></question> tags. Search the image provided to answer the question. Retrieve the most accurate answer in as few words as possible. Do not make up an answer. For questions that ask for numbers, follow the instructions below in the <instructions></instructions> tags. Skip the preamble and provide only the exact precise answer.

If the image does not contain the answer to the question below, then respond with two words only - "no answer".

Refer to the question and instructions below:

<question>
{question}
</question>


<instructions>
1. Search for relevant data and numbers in the charts and graphs present in the image.

2. If the image does not provide a direct answer to the user question, just say "no answer". Do not add statements like "The image does not provide..." and "It only mentions...", instead just respond with "no answer".

3. Do not add any tags in your answer.

4. Scan for the direct answer to the user question. If there is more than one direct answer, give everything that seems like a valid answer to the question in your response.

5. Search for the question deeply in the image. If the question asks about any data or statistics, look for it in charts, tables, graphs first, and then in texts. Check the headings in the image.

</instructions>

If the image does not contain the answer, or if image does not directly answer the user question, do not respond with "The image does not provide..." or anything similar. In this case, your response should always be "no answer" and nothing else.

Assistant: Here is my response to the question. I will give a direct and precise answer to the question if I find it and if not, I will say "no answer":

We used Anthropic’s Claude 3 Sonnet instead of LLaVA 1.5-7b as mentioned in the solution for Part 1. The approach remains the same, “embed first, infer later,” but the model that compiles the final response is changed for simplicity and comparability between approaches.

A response for each question in the dataset is recorded in JSON format and compared to the ground truth provided by SlideVQA.

This approach retrieved a response for 78% of the questions on a dataset of 100 questions, achieving a 50% accuracy on the final responses.

Approach 2: Infer first, embed later

Slide decks are converted into PDF images, one per slide, and passed to Anthropic’s Claude 3 Sonnet to generate a text description. The description is sent to the Amazon Titan Text Embeddings model to generate vector embeddings with 1,536 dimensions. The embeddings are ingested into an OpenSearch Serverless index using an OpenSearch Ingestion pipeline.

Each question is converted into embeddings using the Amazon Titan Text Embeddings model and an OpenSearch vector search is performed using these embeddings. We performed a k-NN search to retrieve the most relevant embedding matching the question. The metadata of the response from the OpenSearch index contains the image description corresponding to the most relevant slide.

We create a prompt with the question and image description and pass it to Anthropic’s Claude 3 Sonnet to receive a precise answer. The following is the prompt template:

Human: Your role is to provide a precise answer to the question in the <question></question> tags. Search the summary provided in the <summary></summary> tags to answer the question. Retrieve the most accurate answer in as few words as possible. Do not make up an answer. For questions that ask for numbers, follow the instructions below in the <instructions></instructions> tags. Skip the preamble and provide only the exact precise answer.

If the summary does not contain the answer to the question below, then respond with two words only - "no answer".

Refer to the question, summary, and instructions below:

<question>
{question}
</question>

<summary>
{summary}
</summary>

<instructions>
1. Search for relevant data and numbers in the summary.

2. If the summary does not provide a direct answer to the user question, just say "no answer". Do not add statements like "The summary does not specify..." and "I do not have enough information...", instead just respond with "no answer".

3. Do not add any tags in your answer.

4. Scan for the direct answer to the user question. If there is more than one direct answer, give everything that seems like a valid answer to the question in your response.

</instructions>

If the summary does not contain the answer, or if summary does not directly answer the user question, do not respond with "The summary does not provide..." or anything similar. In this case, your response should always be "no answer" and nothing else.

Assistant: Here is my response to the question. I will give a direct and precise answer to the question if I find it and if not, I will say "no answer":

The response for each question in the dataset is recorded in JSON format for ease of comparison. The response is compared to the ground truth provided by SlideVQA.

With this approach, we received 44% accuracy on final responses with 75% of the questions retrieving a response out of the 100 questions in the sample dataset.

Analysis of results

In our testing, both approaches produced 50% or less matching results to the questions in the sample dataset. The sample dataset contains a random selection of slide decks covering a wide variety of topics, including retail, healthcare, academic, technology, personal, and travel. Therefore, for a generic question like, “What are examples of tools that can be used?” which lacks additional context, the nearest match could retrieve responses from a variety of topics, leading to inaccurate results, especially when all embeddings are being ingested in the same OpenSearch index. The use of techniques such as hybrid search, pre-filtering based on metadata, and reranking can be used to improve the retrieval accuracy.

One of the solutions is to retrieve more results (increase the k value) and reorder them to keep the most relevant ones; this technique is called reranking. We share additional ways to improve the accuracy of the results later in this post.

The final prompts to Anthropic’s Claude 3 Sonnet in our analysis included instructions to provide a concise answer in as few words as possible to be able to compare with the ground truth. Your responses will depend on your prompts to the LLM.

Pricing

Pricing is dependent on the modality, provider, and model used. For more details, refer to Amazon Bedrock pricing. We use the On-Demand and Batch pricing mode in our analysis, which allow you to use FMs on a pay-as-you-go basis without having to make time-based term commitments. For text-generation models, you are charged for every input token processed and every output token generated. For embeddings models, you are charged for every input token processed.

The following tables show the price per question for each approach. We calculated the average number of input and output tokens based on our sample dataset for the us-east-1 AWS Region; pricing may vary based on your datasets and Region used.

You can use the following tables for guidance. Refer to the Amazon Bedrock pricing website for additional information.

Approach 1  
    Input Tokens   Output Tokens  
Model Description Price per 1,000 Tokens / Price per Input Image Number of Tokens Price Price per 1,000 Tokens Number of Tokens Price
Amazon Titan Multimodal Embeddings Slide/image embedding $0.00006 1 $0.00000006 $0.000 0 $0.00000
Amazon Titan Multimodal Embeddings Question embedding $0.00080 20 $0.00001600 $0.000 0 $0.00000
Anthropic’s Claude 3 Sonnet Final response $0.00300 700 $0.00210000 $0.015 8 $0.00012
Cost per input/output $0.00211606 $0.00012
Total cost per question             $0.00224
  Approach 2     
    Input Tokens   Output Tokens  
Model Description Price per 1,000 Tokens / Price per Input Image Number of Tokens Price Price per 1,000 Tokens Number of Tokens Price
Anthropic’s Claude 3 Sonnet Slide/image description $0.00300 4523 $0.01356900 $0.015 350 $0.00525
Amazon Titan Text Embeddings Slide/image description embedding $0.00010 350 $0.00003500 $0.000 0 $0.00000
Amazon Titan Text Embeddings Question embedding $0.00010 20 $0.00000200 $0.000 0 $0.00000
Anthropic’s Claude 3 Sonnet Final response $0.00300 700 $0.00210000 $0.015 8 $0.00012
Cost per input/output $0.01570600 $0.00537
Total cost per question             $0.02108

Clean up

To avoid incurring charges, delete any resources from Parts 1 and 2 of the solution. You can do this by deleting the stacks using the AWS CloudFormation console.

Conclusion

In Parts 1 and 2 of this series, we explored ways to use the power of multimodal FMs such as Amazon Titan Multimodal Embeddings, Amazon Titan Text Embeddings, and Anthropic’s Claude 3 Sonnet. In this post, we compared the approaches from an accuracy and pricing perspective.

Code for all parts of the series is available in the GitHub repo. We encourage you to deploy both approaches and explore different Anthropic Claude models available on Amazon Bedrock. You can discover new information and uncover new perspectives using your organization’s slide content with either approach. Compare the two approaches to identify a better workflow for your slide decks.

With generative AI rapidly developing, there are several ways to improve the results and approach the problem. We are exploring performing a hybrid search and adding search filters by extracting entities from the question to improve the results. Part 4 in this series will explore these concepts in detail.

Portions of this code are released under the Apache 2.0 License.

Resources

[1] Tanaka, Ryota & Nishida, Kyosuke & Nishida, Kosuke & Hasegawa, Taku & Saito, Itsumi & Saito, Kuniko. (2023). SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. Proceedings of the AAAI Conference on Artificial Intelligence. 37. 13636-13645. 10.1609/aaai.v37i11.26598.


About the Authors

Archana Inapudi is a Senior Solutions Architect at AWS, supporting a strategic customer. She has over a decade of cross-industry expertise leading strategic technical initiatives. Archana is an aspiring member of the AI/ML technical field community at AWS. Prior to joining AWS, Archana led a migration from traditional siloed data sources to Hadoop at a healthcare company. She is passionate about using technology to accelerate growth, provide value to customers, and achieve business outcomes.

Manju Prasad is a Senior Solutions Architect at Amazon Web Services. She focuses on providing technical guidance in a variety of technical domains, including AI/ML. Prior to joining AWS, she designed and built solutions for companies in the financial services sector and also for a startup. She has worked in all layers of the software stack, ranging from webdev to databases, and has experience in all levels of the software development lifecycle. She is passionate about sharing knowledge and fostering interest in emerging talent.

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington, D.C.

Antara Raisa is an AI and ML Solutions Architect at Amazon Web Services supporting strategic customers based out of Dallas, Texas. She also has previous experience working with large enterprise partners at AWS, where she worked as a Partner Success Solutions Architect for digital-centered customers.

Read More