GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases

GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases

GHDDI name and logo on the left with a rainbow spectrum colored honeycomb on the right on a green and blue gradient background

The Global Health Drug Discovery Institute (GHDDI) (opens in new tab) and Microsoft Research recently achieved significant progress in accelerating drug discovery for the treatment of global infectious diseases. Working in close collaboration, the joint team successfully used generative AI and foundation models to design several small molecule inhibitors for essential target proteins of Mycobacterium tuberculosis and coronaviruses. These new inhibitors show outstanding bioactivities, comparable to or surpassing the best-known lead compounds.

This breakthrough is a testament to the team’s combined efforts in generative AI, molecular physicochemical modeling, and iterative feedback loops between scientists and AI technologies. Normally, the discovery and in vitro confirmation of such molecules could take up to several years, but with the acceleration of AI, the joint team achieved these new results in just five months. This research also shows the tremendous potential of AI for helping scientists discover or create the building blocks needed to develop effective treatments for infectious diseases that continue to threaten the health and lives of people around the world.

Since 2019, for example, there have been more than 772 million confirmed cases of COVID-19 worldwide and nearly 7 million deaths from the virus, according to the World Health Organization (WHO), the Centers for Disease Control, and various other sources. Although vaccines have reduced the incidence and deadliness of the disease, the coronavirus continues to mutate and evolve, making it a serious ongoing threat to global health. Meanwhile, the WHO reports that tuberculosis continues to be a leading cause of death among infectious diseases, second only to COVID-19 in 2022, when 10.6 million people worldwide fell ill with TB and the disease killed 1.3 million (the most recent figures currently available).

Laying the foundation for new infectious disease treatments

Microsoft Research has rich experience in developing and pre-training large AI models specialized for proteins and molecules, demonstrated in both property prediction and molecular generation. Based on those experiences, Microsoft Research developed and maintains ownership of an AI model for molecule generation tailored for specific protein targets. The generated compounds were virtually screened and further optimized by data scientists and medicinal chemists from GHDDI, followed by compound synthesis and wet-lab experiments to quantify bioactivities. The experimental results were then fed back to the research team at Microsoft for AI model improvement and new compound generation.

This AI-expert-experiment integrated pipeline enables the success of novel compound generation for protein targets in Mycobacterium tuberculosis and coronaviruses SARS-CoV-2. In less than five months, the joint team designed several chemical compounds that are effective in inhibiting these pathogens’ essential target proteins, accelerating the structure-based drug discovery process.

Figure 1. Two potential inhibitor compounds (generated by our method) for ClpP of tuberculosis.
Dose response curves of the compounds generated for coronavirus, with GRL0617 as the reference compound, demonstrating enhanced bioactivity. The most recent progress is that the joint team has effectively optimized the IC50 to 0.18uM, which is approximately an eight-fold improvement compared to GRL0617.
Dose response curves of the compounds generated for coronavirus, with GRL0617 as the reference compound, demonstrating enhanced bioactivity. The most recent progress is that the joint team has effectively optimized the IC50 to 0.18uM, which is approximately an eight-fold improvement compared to GRL0617.

One distinguishing feature of AI-generated molecules is their novel scaffold structures, which are important because they create the potential for these molecules to be developed into a new class of drug candidates. These novel structures offer the possibility of more effective treatments, and also help to address the escalating challenge of antimicrobial resistance (AMR), a major hurdle in treating infectious diseases like tuberculosis and COVID-19.

“In the current landscape of scientific research, we encounter unparalleled challenges but also have unprecedented opportunities,” said Dr. Sheng Ding, institute director of GHDDI. “Innovation stands as the central catalyst for scientific advancement and a crucial element in addressing global health challenges. I’m excited about our collaboration with Microsoft Research and gratified with the progress we’ve jointly achieved. Without a doubt, our combined efforts will enhance R&D efficiency and expedite the process of drug discovery.”

“This represents a collaboration that transcends disciplines and boundaries,” he noted. “Our combined strengths will advance pharmaceutical research, paving new avenues in scientific exploration. Going forward, we anticipate deploying such cutting-edge technologies in uncharted realms of life sciences. This will enable us to offer more comprehensive, profound, and practical solutions for global health issues.”

MICROSOFT RESEARCH PODCAST

Abstracts: October 23, 2023

On “Abstracts,” Partner Research Manager Andy Gordon & Senior Researcher Carina Negreanu explore new work introducing co-audit, a term for any tool-assisted experience that helps users of generative AI find and fix mistakes in AI output.


Using AI to improve global health

Embracing the principle of open innovation, the collaboration between GHDDI and Microsoft Research is dedicated to harnessing AI technology to expedite drug discovery. The goal is to contribute to global health equity through the development of lifesaving medications and the prompt delivery of safer and more effective drug solutions that are accessible to everyone.  The collaboration focuses on infectious diseases that pose a threat to global health, including but not limited to tuberculosis, viral infections, and malaria. Both parties are committed to a deep integration of generative AI, foundational models, high-throughput virtual screening, and expert knowledge to tackle these challenges.

“Successful AI-driven drug discovery necessitates a tight-knit collaboration between AI specialists and medicinal experts,” said Dr. Tie-Yan Liu, distinguished scientist at Microsoft Research AI4Science. “In recent years, our globally recognized team at Microsoft Research has been deeply engaged in interdisciplinary research between AI and natural science. To complement this, GHDDI experts bring to the table a wealth of industry experience and profound domain knowledge. Their experimental facilities not only allow for testing but also help provide invaluable feedback for training AI models. Because of our close collaboration, we look forward to producing groundbreaking research outcomes with the potential to redefine the future of healthcare through AI technology innovation.”

Accelerating drug discovery

Commenting on the research into Mycobacterium tuberculosis and coronaviruses, Dr. Rumin Zhang, chief scientific officer at GHDDI, noted that the application of AI technology by the collaborative team managed to considerably reduce the traditionally lengthy drug discovery process. The team was able to design and validate highly effective small molecule inhibitors for the pathogens in just five months.

“This is an exceptional accomplishment that underscores the immense potential of AI in efficient de novo drug design. It also vividly illustrates the team’s exceptional innovative capacity and professional prowess,” he said. “We are excited about this innovative R&D strategy leading to more groundbreaking advancements in a broader spectrum of future drug discovery projects.”

“This work is all about pushing the boundaries of AI technology for application in new drug R&D,” said Dr. Tao Qin, senior principal researcher at Microsoft Research AI4Science “We aim to leverage AI innovations to enhance human health, tackle worldwide health issues, and ensure the advantages of AI technology are accessible to all.”

“We plan to intensify and broaden our collaboration, further advancing the use of AI technology in the realm of life sciences,” said Dr. Jinjiang Guo, head of the Data Science Department at GHDDI. “This will yield novel insights that will enrich researchers’ understanding of mechanisms underlying diseases and life, thus paving the way for the development of innovative treatment strategies and providing more effective solutions for diseases that have long affected human health. We are highly optimistic about the potential of this collaboration and are confident that it will have a substantial impact on the future of the healthcare field.”

Next steps

In the next phase, Microsoft Research and GHDDI will collaborate to optimize the discovered hit compounds, enhance ADMET properties, progress toward preclinical studies, and initiate a broader range of drug-discovery projects.

The post GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases appeared first on Microsoft Research.

Read More

Māori Speech AI Model Helps Preserve and Promote New Zealand Indigenous Language

Māori Speech AI Model Helps Preserve and Promote New Zealand Indigenous Language

Indigenous languages are under threat. Some 3,000 — three-quarters of the total — could disappear before the end of the century, or one every two weeks, according to UNESCO.

As part of a movement to protect such languages, New Zealand’s Te Hiku Media, a broadcaster focused on the Maori people’s indigenous language known as te reo, is using trustworthy AI to help preserve and revitalize the tongue.

Using ethical, transparent methods of speech data collection and analysis to maintain data sovereignty for the Māori people, Te Hiku Media is developing automatic speech recognition (ASR) models for te reo, which is a Polynesian language.

Built using the open-source NVIDIA NeMo toolkit for ASR and NVIDIA A100 Tensor Core GPUs, the speech-to-text models transcribe te reo with 92% accuracy. It can also transcribe bilingual speech using English and te reo with 82% accuracy. They’re pivotal tools, made by and for the Māori people, that are helping preserve and amplify their stories.

“There’s immense value in using NVIDIA’s open-source technologies to build the tools we need to ultimately achieve our mission, which is the preservation, promotion and revitalization of te reo Māori,” said Keoni Mahelona, chief technology officer at Te Hiku Media, who leads a team of data scientists and developers, as well as Māori language experts and data curators, working on the project.

“We’re also helping guide the industry on ethical ways of using data and technologies to ensure they’re used for the empowerment of marginalized communities,” added Mahelona, a Native Hawaiian now living in New Zealand.

Building a ‘House of Speech’

Te Hiku Media began more than three decades ago as a radio station aiming to ensure te reo had space on the airwaves. Over the years, the organization incorporated television broadcasting and, with the rise of the internet, it convened a meeting in 2013 with the community’s elders to form a strategy for sharing content in the digital era.

“The elders agreed that we should make the stories accessible online for our community members — rather than just keeping our archives on cassettes in boxes — but once we had that objective, the challenge was how to do this correctly, in alignment with our strong roots in valuing sovereignty,” Mahelona said.

Instead of uploading its video and audio sources to popular, global platforms — which, in their terms and conditions of use, require signing over certain rights related to the content — Te Hiku Media decided to build its own content distribution platform.

Called Whare Kōrero — meaning “house of speech” — the platform now holds more than 30 years’ worth of digitized, archival material featuring about 1,000 hours of te reo native speakers, some of whom were born in the late 19th century, as well as more recent content from second-language learners and bilingual Māori people.

Now, around 20 Māori radio stations use and upload their content to Whare Kōrero. Community members can access the content through an app.

“It’s an invaluable reproduce of acoustic data,” Mahelona said.

Turning to Trustworthy AI

Such a trove held incredible value for those working to revitalize the language, the Te Hiku Media team quickly realized, but manual transcription required pulling lots of time and effort from limited resources. So began the organization’s trustworthy AI efforts, in 2016, to accelerate its work using ASR.

“No one would have a clue that there are eight NVIDIA A100 GPUs in our derelict, rundown, musky-smelling building in the far north of New Zealand — training and building Māori language models,” Mahelona said. “But the work has been game-changing for us.”

To collect speech data in a transparent, ethically compliant, community-oriented way, Te Hiku Media began by explaining its cause to elders, garnering their support and asking them to come to the station to read phrases aloud.

“It was really important that we had the support of the elders and that we recorded their voices, because that’s the sort of content we want to transcribe,” Mahelona said. “But eventually these efforts didn’t scale — we needed second-language learners, kids, middle-aged people and a lot more speech data in general.”

So, the organization ran a crowdsourcing campaign, Kōrero Māori, to collect highly labeled speech samples according to the Kaitiakitanga license, which ensures Te Hiku Media uses the data only for the benefit of the Māori people.

In just 10 days, more than 2,500 signed up to read 200,000+ phrases, providing over 300 hours of labeled speech data, which was used to build and train the te reo Māori ASR models.

In addition to other open-source trustworthy AI tools, Te Hiku Media now uses the NVIDIA NeMo toolkit’s ASR module for speech AI throughout its entire pipeline. The NeMo toolkit comprises building blocks called neural modules and includes pretrained models for language model development.

“It’s been absolutely amazing — NVIDIA’s open-source NeMo enabled our ASR models to be bilingual and added automatic punctuation to our transcriptions,” Mahelona said.

Te Hiku Media’s ASR models are the engines running behind Kaituhi, a te reo Māori transcription service now available online.

The efforts have spurred similar ASR projects now underway by Native Hawaiians and the Mohawk people in southeastern Canada.

“It’s indigenous-led work in trustworthy AI that’s inspiring other indigenous groups to think: ‘If they can do it, we can do it, too,’” Mahelona said.

Learn more about NVIDIA-powered trustworthy AI, the NVIDIA NeMo toolkit and how it enabled a Telugu language speech AI breakthrough.

Read More

Starstruck: 3D Artist Brellias Brings Curiosity to Light This Week ‘In the NVIDIA Studio’

Starstruck: 3D Artist Brellias Brings Curiosity to Light This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

Curiosity leads the way for this week’s featured In the NVIDIA Studio 3D artist, Brellias.

It’s what inspired the native Chilean’s latest artwork Estrellitas, which in English translates to “little stars.” The scene expresses the mixture of emotions that comes with curiosity, depicting a young girl holding little stars in her hand with a conflicted expression.

“She’s excited to learn about them, but she’s also a little scared,” Brellias explained.

The striking visual piece, rich with vibrant colors and expertly executed textures, underscores that while curiosity can invoke various emotions — both joyful and painful — it is always a source of change and growth.

A Sky Full of Stars

To start, Brellias first visualized and reworked an existing 3D scene of a woman in Blender. He used Blender’s built-in multi-resolution modifier for sculpting and added some shape keys to achieve the desired modifications.

He also created a custom shader for the character’s skin — a stylistic choice to lend its appearance a galactic hue.

Brellias is an especially big fan of purple, blue and maroon hues.

Next, Brellias tapped Blender’s OptiX GPU-accelerated viewport denoising, powered by his GeForce RTX GPU.

“The technology helps reduce noise and improve the quality of the viewport image more quickly, allowing me to make decisions and iterate on the render faster,” he said.

Out-of-this-world levels of detail.

Next, Brellias animated the scene using a base model from Daz Studio, a free media design software developed by Daz 3D. Daz features an AI denoiser for high-performance interactive rendering that can also be accelerated by RTX GPUs.

In addition, rig tools in Blender made the animation process easy, eliminating the need to modify file formats.

 

To animate the character’s face, Brellias tied drivers to shape keys using empties, enabling greater fluidity and control over facial expressions.

Geometry nodes bring “Estrellitas” to life.

Brellias then used geometry nodes in Blender to animate the character’s hair, giving it a magical floating effect. To light the scene, Brellias added some ambient light behind the character’s face and between its hands. His RTX GPU accelerated OptiX ray tracing in Blender’s Cycles for the fastest final-frame renders.

 

Finally, he moved to Blackmagic Design’s DaVinci Resolve to denoise and deflicker the scene for the smoothest-looking animation.

Here, Brellias’ RTX GPU accelerated the color grading, video editing and color scoping processes, dramatically speeding his creative workflow. Other RTX-accelerated AI features, including facial recognition for automatically tagging clips and the tracking of effects, were available for his use.

 

Estrellitas was partially inspired by Brellias’ own curiosity in exploring NVIDIA and GeForce RTX GPU technologies to power content creation workflows — a venture that provided rewarding results.

“Every step of my creative process involves GPU acceleration or AI in some way or another,” said Brellias. “I can’t imagine creating without a powerful GPU at my disposal.”

His curiosity in AI extends to productivity. He recently installed the NVIDIA Broadcast app, which can transform any room into a home studio.

The app has enhanced Brellias’ microphone performance by canceling external noise and echo — especially useful given his urban surroundings.

Download the Broadcast beta and explore the rest of the Studio suite of apps, including Canvas, which uses AI to turn simple brushstrokes into realistic landscape images, and RTX Remix, which allows modders to create AI-powered RTX remasters of classic games. The apps are all free for RTX GPU owners.

Digital 3D artist Brellias.

Check out Brellias’ portfolio on Instagram.

Follow NVIDIA Studio on Instagram, X and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Read More

Accelerating Triton Dequantization Kernels for GPTQ

Accelerating Triton Dequantization Kernels for GPTQ

TL;DR

Leveraging a first principles approach, we showcase a step by step process undertaken to accelerate the current Triton GPTQ kernels by 3x (core GPTQ) and 6x (AutoGPTQ). Example: 275us to 47us on a typical Llama style inference input. The goal is to provide a helpful template for accelerating any given Triton kernel. We provide a background on Triton and GPTQ quantization and dequantization process, showcase the impact of coalesced memory access to improve shared and global memory throughput, highlight changes made to reduce warp stalling to improve total throughput, and an overview on integrating Triton kernels into PyTorch code. Longer term, we hope to surpass the existing CUDA native GPTQ kernel with our Triton kernel.

Fig 1: Performance benchmarking the optimized AutoGTPQ kernel vs the current AutoGPTQ kernel on H100

Fig 1: Performance benchmarking the optimized AutoGTPQ kernel vs the current AutoGPTQ kernel on H100

Fig 2: Performance benchmarking the newly optimized AutoGTPQ kernel vs the current AutoGPTQ kernel on A100

Fig 2: Performance benchmarking the newly optimized AutoGTPQ kernel vs the current AutoGPTQ kernel on A100

Fig 3: Even with these improvements, there remains a gap between our optimized Triton kernel and the CUDA native AutoGTPQ kernel on A100.

Fig 3: Even with these improvements, there remains a gap between our optimized Triton kernel and the CUDA native AutoGTPQ kernel on A100. More to come…

1.0 Introduction to Triton

The Triton framework provides a hardware agnostic way of programming and targeting GPUs, currently supporting both NVIDIA and AMD, with support for additional hardware vendors in progress. Triton is now a mainstay for PyTorch 2.0 as torch.compile decomposes eager PyTorch and re-assembles it into a high percentage of Triton kernels with PyTorch connecting code.

As Triton becomes more widely adopted, it will be essential that programmers understand how to systematically step through the Triton stack (from the high level Python down to the low-level SASS) to address performance bottlenecks in order to optimize GPU efficiency for algorithms that go beyond torch.compile generated kernels.

In this post, we will introduce some core concepts of the Triton programming language, how to identify common performance limiters in GPU kernels, and in parallel, tune a quantization kernel used in AutoGPTQ that can be used for high throughput inference applications.

Intro to GPTQ Quantization and Dequantization

GPTQ is a quantization algorithm that is able to compress ultra-large (175B+) LLMs efficiently to int4 bit representation, via approximate second order information (Hessian inverse). AutoGPTQ is a framework built on GPTQ, allowing for rapid dequantization and inference/serving of LLMs that have been quantized with GPTQ.

As part of the AutoGPTQ stack, they provide a Triton GPTQ kernel to handle the dequantization of a model for inference.

The basic process for INT quantization is shown below and involves determining the scale and zero point, and then computing the quantized 4bit Weight using the Scale and Zero point:

The basic process for INT quantization

We thus store the 4 Bit weights along with the meta information of Scale and ZeroPoint for each group of weights.

To ‘dequant’ these weights, we do the following:

To ‘dequant’ these weights

And then proceed to Matrix Multiply the dequantized weights with the dense input feature matrix for this linear layer.

2.0 Identify the Bottlenecks – Optimizing Matrix Multiplication

As it turns out, making a fast matrix multiplication kernel is not trivial. A naively implemented matrix multiply will rarely reach peak throughput performance on highly parallel machines like GPUs. So – we need to tackle our compute and memory subsystems in our GPU in an hierarchical fashion to make sure we are maximally utilizing each resource.

We start our optimization process, by running the unoptimized Triton Kernel, through the Nvidia Nsight Compute tool and taking a note of some important metrics and warnings:

some important metrics and warnings

Fig xy (todo)

some important metrics and warnings

We notice first that both compute and memory throughput are low, 7.40% and 21.19% respectively (fig xy) . Knowing that for typical inference matrix problem sizes, we are in the memory bound regime, we will attempt to optimize the kernel by applying code changes that target the memory subsystem of our A100 GPU.

The three topics this post will cover are:

  1. L2 Optimization
  2. Vectorized Load
  3. Warp Stalling

Let’s walk through each topic, make the appropriate changes, and see its corresponding impact on our Triton Kernel. This Triton kernel is a fused dequantization kernel that dequantizes a packed int32 weight (we will refer to this as the B Matrix) Tensor into int4 weights, performs matrix multiplication with the activation tensor (refer to as the A matrix) in FP16 mode, and then stores the results back to a matrix C.

The above is referred to as W4A16 quantization. Keep in mind that the process we describe can and should be used for the development of any GPU kernel, as these are common bottlenecks in any unoptimized kernel.

3.0 L2 Optimization

This optimization already exists in the AutoGPTQ kernel, but we’d like to dedicate a section to this to help readers better understand how mapping and execution order of thread blocks is handled in Triton. Thus, we will step through a naive mapping and then a more optimal mapping to see its corresponding impact.

Let’s build up our kernel naively, starting with a “linear” load from global memory and then compare it to a more optimized “swizzled” load. Linear vs Swizzled determines the execution order of our grid of work on the GPU. Let’s take a look at the hints that the Nvidia Nsight Compute Tool provides regarding our kernels shared memory access pattern in the naive case:

the hints from the Nvidia Nsight Compute Tool

To tackle this issue we can use an approach referred to as “tile-swizzling.” The idea of this method is to launch our thread blocks in a more L2 cache friendly order.

Let’s take a step back and familiarize ourselves with some Triton semantics and make a simple CUDA analogy to understand the concept better. Triton kernels launch “programs”. These so-called programs map to the concept of a Thread Block in CUDA and it is the basic unit of parallelism in a Triton Kernel. Every program has with it associated a “pid” and all the threads in a program are guaranteed to be executing the same instruction.

The Triton programs will be distributed onto your SMs in a naive-way if you do a simple linear mapping of “pid” to a 2D grid location of your output matrix C.

This 2D grid location is determined by pid_m and pid_n in Triton. We would like to exploit data and cache locality in the L2 cache of our GPU, when we distribute our grid of work. To do this in Triton we can make the following changes:

To do this in Triton

The code highlighted in red would be the naive “linear” tile ordering, and the code highlighted in green is the “swizzled” tile ordering. This type of launch promotes a sense of locality. Here is a visual to help understand this better.

a sense of locality

After incorporating this change, the profiler no longer complains about uncoalesced memory accesses. Let’s take a look at how our memory throughput has changed:

how our memory throughput has changed

This change was tested on a simple load store kernel. Looking at the GPU speed of light statistics section in the profiler we also see a 112.07% increase in the memory throughput of the simple load kernel, which is what we were after with this optimization. Again, this optimization already exists in the AutoGPTQ kernel, but is the boilerplate logic that every Triton Kernel programmer will have to write in the beginning of their kernel, before any of the exciting dequantization or matrix multiply logic. It is thus important to understand that:

  1. This mapping is not unique

  2. Triton does not automatically handle this kind of optimization for the programmer, and careful thought must be taken to ensure your kernel is optimally handling shared memory accesses

These are not obvious for those new to Triton, as much of the shared memory access optimization is handled by the Triton compiler. However, in the cases where these are not handled by the compiler, it is important to be able to understand what tools and methods are available to us to be able to influence memory behavior.

4.0 Vectorized Load

Now, back to the original complaints of our unoptimized kernel. We want to optimize the global memory access pattern of our kernel. From the details page of the Nvidia Nsight compute tool, we see the following note, where the profiler is complaining about uncoalesced global memory accesses.

Let’s dig deeper and take a look at the SASS (Assembly) Code load for an unoptimized memory read:

an unoptimized memory read

This load operation resulted in 32 global load operations that are 16 bit wide. This is not optimal.

We would like to do our global memory loads in a vectorized way so that it results in the least amount of load instructions. To combat this we can give the Triton Compiler some help.

code block

The green highlighted lines above act as a compiler hint. It tells the compiler that these elements are contiguous in memory and that this load operation can be coalesced.

Let’s see the effect in assembly after adding these lines.

the effect in assembly after adding these lines

The load is now performed in 4 global load operations that are each 128 bit wide, instead of 32 16 bit global load operations. This means 28 fewer memory fetch instructions, and importantly a coalesced memory access. This can be seen from the fact that a single thread is not accessing consecutive memory addresses, which without the compiler hint, was the behavior.

The resulting effect is 73x speedup in an isolated load operation, and after incorporating it in the full dequantization kernel we were able to see another 6% speedup. Another step in the right direction!

5.0 Warp Stalling

performance limiter, warp stalling

Now putting all the changes back into our full dequantization kernel, we see the following performance limiter, warp stalling.

These warp stalls are mostly caused by ‘Long Scoreboard’ stalls, accounting for 92.63% of the total.

At a high level, long scoreboard stalls happen when a warp requires data that may not be ready yet in order to be in the “issued” state. In other words GPUs are throughput machines, and we need to hide the latency of load instructions with compute instructions. By loading more data and rearranging where the load instructions are in the script we can take care of this problem.

In an ideal scenario, each warp scheduler would be able to issue 1 instruction every clock cycle. Note – Every SM on an A100 GPU has 4 warp schedulers.

However – our kernel has bottlenecks and is spending 4.4 cycles in the stall state with the block size that AutoGPTQ Triton kernel deems as optimal given the presets it has.

How do we improve this?

We want to be able to increase our memory throughput so that we can increase the chance that when a warp issues an instruction, we won’t be waiting for loads to be stored in SRAM so that they can be used for computation. We played around with multiple parameters (such as number of pipeline stages, and number of warps) and the one that had the biggest impact was increasing the block size by a factor of 2 in the k dimension.

These changes yield an immediate impact on both compute and memory throughput.

an immediate impact on both compute and memory throughput

We also see the long scoreboard wait time at the step where we shift and scale the quantized weights drop significantly, which is what we identified as the original bottleneck in the source code. While there are still stalls at this line, only 68% of them are caused by long scoreboard stalls, compared to the original 92%. Ideally, we do not observe ANY stalls, so there is still work to be done here, but a reduction in the amount of stalls caused by long scoreboard tells us that our data is at this point ready to be used (in L1TEX) memory by an instruction that a warp wants to execute, at a higher frequency then the original kernel.

1.4x speedup in the execution time of our kernel

The corresponding impact is a 1.4x speedup in the execution time of our kernel.

6.0 Results

By tackling all these problem areas methodically our resulting kernel is 6x faster on the Nvidia A100 GPU than if you were to use the Triton kernel AutoGPTQ provides out-of-the-box.

Taking a relevant Llama inference sample data point, the Triton kernel we’ve developed takes 47us to perform dequantization and matrix multiplication compared to the 275us taken by the AutoGPTQ kernel for the same matrix size.

By replicating this step-by-step approach it should be possible to get similar speedups in other kernels, and help build understanding on common GPU bottlenecks and how to tackle them.

It is important to note that while strides have been made in improving the performance of the AutoGPTQ Triton Kernel, we have still not closed the gap on the current exllamaV2 CUDA kernels found in AutoGPTQ.

More research is required to understand how we can further optimize this kernel to match equivalent custom CUDA kernel performance.

Summary and Future work

Triton extends PyTorch by allowing low level GPU optimizations to be done at a higher level of abstraction than CUDA programming, with the net result that adding optimized Triton kernels can help PyTorch models run faster.

Our goal in this post was to show an example of accelerating the GPTQ dequant kernel and provide a template workflow for how the accelerations were achieved.

For future work, SplitK work decomposition for the matrix multiplication is a potential speed up we’ll investigate.

Integrating custom Triton Kernels into PyTorch

Given the acceleration shown above, a common question is how to actually use a custom kernel in a given PyTorch codebase.

A triton kernel will contain at least two parts – the actual Triton kernel code which will be compiled by the Triton compiler:

the actual Triton kernel code which will be compiled by the Triton compiler

Along with the actual kernel code is a python wrapper, that may or may not subclass the PyTorch autograd class – depending if it’s going to support a backwards pass (i.e. for training purposes or only for inference purposes).

You simply import the python class into your PyTorch code where you want to use it much like any other Python / PyTorch function.

import the python class into your PyTorch code

In this case, simply importing and then using ‘fast_qlinear’ would invoke the underlying Triton kernel with the speed-ups we’ve shown above applied to your PyTorch model.

Acknowledgements

Thanks to Jamie Yang and Hao Yu from IBM Research for their technical guidance in the collection of these results.

Read More

Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization

Recent advances in deep learning and automatic speech recognition have boosted the accuracy of end-to-end speech recognition to a new level. However, recognition of personal content such as contact names remains a challenge. In this work, we present a personalization solution for an end-to-end system based on connectionist temporal classification. Our solution uses class-based language model, in which a general language model provides modeling of the context for named entity classes, and personal named entities are compiled in a separate finite state transducer. We further introduce a…Apple Machine Learning Research

NVIDIA CEO: ‘This Year, Every Industry Will Become a Technology Industry’

NVIDIA CEO: ‘This Year, Every Industry Will Become a Technology Industry’

“This year, every industry will become a technology industry,” NVIDIA founder and CEO Jensen Huang told attendees Wednesday during the annual J.P. Morgan Healthcare Conference.

“You can now recognize and learn the language of almost anything with structure, and you can translate it to anything with structure — so text-protein, protein-text,” Huang said in a fireside chat with Martin Chavez, partner and vice chairman of global investment firm Sixth Street Partners and board chair of Recursion, a biopharmaceutical company. “This is the generative AI revolution.”

The conversation, which took place at the historic San Francisco Mint, followed a presentation at the J.P. Morgan conference Monday by Kimberly Powell, NVIDIA’s VP of healthcare. In her talk, Powell announced that Recursion is the first hosting partner to offer a foundation model through the NVIDIA BioNeMo cloud service, which is advancing into beta this month.

She also said that Amgen, one of the first companies to employ BioNeMo, plans to advance drug discovery with generative AI and NVIDIA DGX SuperPOD — and that BioNeMo is used by a growing number of techbio companies, pharmas, AI software vendors and systems integrators. Among them are Deloitte, Innophore, Insilico Medicine, OneAngstrom, Recursion and Terray Therapeutics.

From Computer-Aided Chip Design to Drug Design

Healthcare customers and partners now consume well over a billion dollars in NVIDIA GPU computing each year — directly and indirectly through cloud partners.

Huang traced NVIDIA’s involvement in accelerated healthcare back to two research projects that caught his attention around 15 years ago: one at Mass General tapped NVIDIA GPUs to reconstruct CT images, another at the University of Illinois Urbana-Champaign applied GPU acceleration to molecular dynamics.

“It opened my mind that we could apply the same methodology that we use in computer-aided chip design to help the world of drug discovery go from computer-aided drug discovery to computer-aided drug design,” he said, realizing that, “if we scale this up by a billion times, we could simulate biology.”

After 40 years of advancements in computer-aided chip design, engineers can now build complex computing systems entirely in simulation, Huang explained. Over the next decade, the same could be true for AI-accelerated drug design.

“Almost everything will largely start in silico, largely end in silico,” he said, using a term that refers to an experiment run on a computer.

Collaborating on the Future of Drug Discovery and Medical Instruments

With the progress made to date, computer-aided drug discovery is “genuinely miraculous,” Huang said.

NVIDIA is propelling the field forward by building state-of-the-art AI models and powerful computing platforms, and by collaborating with domain experts and investing in techbio companies.

“We are determined to work with you to advance this field,” Huang said, inviting healthcare innovators to reach out to NVIDIA. “We deeply believe that this is going to be the future of the way that drugs will be discovered and designed.”

The company’s pipelines for accelerated healthcare include algorithms for cryo-electron microscopy, X-ray crystallography, gene sequencing, amino acid structure prediction and virtual drug molecule screening. And as AI advances, these computing tools are becoming much easier to access, Huang said.

“Because of artificial intelligence and the groundbreaking work that our industry has done, we have closed the technology divide in a dramatic way,” he said. “Everybody is a programmer, and the programming language of the future is called ‘human.’”

Beyond drug development, this transformation to a software-defined, AI-driven industry will also advance medical instruments.

“A medical instrument is never going to be the same again. Ultrasound systems, CT scan systems, all kinds of instruments — they’re always going to be a device plus a whole bunch of AIs,” Huang said. “The value that will create, the opportunities you create, are going to be incredible.”

For more from NVIDIA at the J.P. Morgan Healthcare Conference, listen to the audio recording and view the presentation deck of Powell’s session.

Learn about NVIDIA’s AI platform for healthcare and life sciences and subscribe to NVIDIA healthcare news.

Read More

AMIE: A research AI system for diagnostic medical reasoning and conversations

AMIE: A research AI system for diagnostic medical reasoning and conversations

The physician-patient conversation is a cornerstone of medicine, in which skilled and intentional communication drives diagnosis, management, empathy and trust. AI systems capable of such diagnostic dialogues could increase availability, accessibility, quality and consistency of care by being useful conversational partners to clinicians and patients alike. But approximating clinicians’ considerable expertise is a significant challenge.

Recent progress in large language models (LLMs) outside the medical domain has shown that they can plan, reason, and use relevant context to hold rich conversations. However, there are many aspects of good diagnostic dialogue that are unique to the medical domain. An effective clinician takes a complete “clinical history” and asks intelligent questions that help to derive a differential diagnosis. They wield considerable skill to foster an effective relationship, provide information clearly, make joint and informed decisions with the patient, respond empathically to their emotions, and support them in the next steps of care. While LLMs can accurately perform tasks such as medical summarization or answering medical questions, there has been little work specifically aimed towards developing these kinds of conversational diagnostic capabilities.

Inspired by this challenge, we developed Articulate Medical Intelligence Explorer (AMIE), a research AI system based on a LLM and optimized for diagnostic reasoning and conversations. We trained and evaluated AMIE along many dimensions that reflect quality in real-world clinical consultations from the perspective of both clinicians and patients. To scale AMIE across a multitude of disease conditions, specialties and scenarios, we developed a novel self-play based simulated diagnostic dialogue environment with automated feedback mechanisms to enrich and accelerate its learning process. We also introduced an inference time chain-of-reasoning strategy to improve AMIE’s diagnostic accuracy and conversation quality. Finally, we tested AMIE prospectively in real examples of multi-turn dialogue by simulating consultations with trained actors.

AMIE was optimized for diagnostic conversations, asking questions that help to reduce its uncertainty and improve diagnostic accuracy, while also balancing this with other requirements of effective clinical communication, such as empathy, fostering a relationship, and providing information clearly.

Evaluation of conversational diagnostic AI

Besides developing and optimizing AI systems themselves for diagnostic conversations, how to assess such systems is also an open question. Inspired by accepted tools used to measure consultation quality and clinical communication skills in real-world settings, we constructed a pilot evaluation rubric to assess diagnostic conversations along axes pertaining to history-taking, diagnostic accuracy, clinical management, clinical communication skills, relationship fostering and empathy.

We then designed a randomized, double-blind crossover study of text-based consultations with validated patient actors interacting either with board-certified primary care physicians (PCPs) or the AI system optimized for diagnostic dialogue. We set up our consultations in the style of an objective structured clinical examination (OSCE), a practical assessment commonly used in the real world to examine clinicians’ skills and competencies in a standardized and objective way. In a typical OSCE, clinicians might rotate through multiple stations, each simulating a real-life clinical scenario where they perform tasks such as conducting a consultation with a standardized patient actor (trained carefully to emulate a patient with a particular condition). Consultations were performed using a synchronous text-chat tool, mimicking the interface familiar to most consumers using LLMs today.

AMIE is a research AI system based on LLMs for diagnostic reasoning and dialogue.

AMIE: an LLM-based conversational diagnostic research AI system

We trained AMIE on real-world datasets comprising medical reasoning, medical summarization and real-world clinical conversations.

It is feasible to train LLMs using real-world dialogues developed by passively collecting and transcribing in-person clinical visits, however, two substantial challenges limit their effectiveness in training LLMs for medical conversations. First, existing real-world data often fails to capture the vast range of medical conditions and scenarios, hindering the scalability and comprehensiveness. Second, the data derived from real-world dialogue transcripts tends to be noisy, containing ambiguous language (including slang, jargon, humor and sarcasm), interruptions, ungrammatical utterances, and implicit references.

To address these limitations, we designed a self-play based simulated learning environment with automated feedback mechanisms for diagnostic medical dialogue in a virtual care setting, enabling us to scale AMIE’s knowledge and capabilities across many medical conditions and contexts. We used this environment to iteratively fine-tune AMIE with an evolving set of simulated dialogues in addition to the static corpus of real-world data described.

This process consisted of two self-play loops: (1) an “inner” self-play loop, where AMIE leveraged in-context critic feedback to refine its behavior on simulated conversations with an AI patient simulator; and (2) an “outer” self-play loop where the set of refined simulated dialogues were incorporated into subsequent fine-tuning iterations. The resulting new version of AMIE could then participate in the inner loop again, creating a virtuous continuous learning cycle.

Further, we also employed an inference time chain-of-reasoning strategy which enabled AMIE to progressively refine its response conditioned on the current conversation to arrive at an informed and grounded reply.

AMIE uses a novel self-play based simulated dialogue learning environment to improve the quality of diagnostic dialogue across a multitude of disease conditions, specialities and patient contexts.

We tested performance in consultations with simulated patients (played by trained actors), compared to those performed by 20 real PCPs using the randomized approach described above. AMIE and PCPs were assessed from the perspectives of both specialist attending physicians and our simulated patients in a randomized, blinded crossover study that included 149 case scenarios from OSCE providers in Canada, the UK and India in a diverse range of specialties and diseases.

Notably, our study was not designed to emulate either traditional in-person OSCE evaluations or the ways clinicians usually use text, email, chat or telemedicine. Instead, our experiment mirrored the most common way consumers interact with LLMs today, a potentially scalable and familiar mechanism for AI systems to engage in remote diagnostic dialogue.

Overview of the randomized study design to perform a virtual remote OSCE with simulated patients via online multi-turn synchronous text chat.

Performance of AMIE

In this setting, we observed that AMIE performed simulated diagnostic conversations at least as well as PCPs when both were evaluated along multiple clinically-meaningful axes of consultation quality. AMIE had greater diagnostic accuracy and superior performance for 28 of 32 axes from the perspective of specialist physicians, and 24 of 26 axes from the perspective of patient actors.

AMIE outperformed PCPs on multiple evaluation axes for diagnostic dialogue in our evaluations.
Specialist-rated top-k diagnostic accuracy. AMIE and PCPs top-k differential diagnosis (DDx) accuracy are compared across 149 scenarios with respect to the ground truth diagnosis (a) and all diagnoses listed within the accepted differential diagnoses (b). Bootstrapping (n=10,000) confirms all top-k differences between AMIE and PCP DDx accuracy are significant with p <0.05 after false discovery rate (FDR) correction.
Diagnostic conversation and reasoning qualities as assessed by specialist physicians. On 28 out of 32 axes, AMIE outperformed PCPs while being comparable on the rest.

Limitations

Our research has several limitations and should be interpreted with appropriate caution. Firstly, our evaluation technique likely underestimates the real-world value of human conversations, as the clinicians in our study were limited to an unfamiliar text-chat interface, which permits large-scale LLM–patient interactions but is not representative of usual clinical practice. Secondly, any research of this type must be seen as only a first exploratory step on a long journey. Transitioning from a LLM research prototype that we evaluated in this study to a safe and robust tool that could be used by people and those who provide care for them will require significant additional research. There are many important limitations to be addressed, including experimental performance under real-world constraints and dedicated exploration of such important topics as health equity and fairness, privacy, robustness, and many more, to ensure the safety and reliability of the technology.

AMIE as an aid to clinicians

In a recently released preprint, we evaluated the ability of an earlier iteration of the AMIE system to generate a DDx alone or as an aid to clinicians. Twenty (20) generalist clinicians evaluated 303 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) ClinicoPathologic Conferences (CPCs). Each case report was read by two clinicians randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or AMIE assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools.

Assisted randomized reader study setup to investigate the assistive effect of AMIE to clinicians in solving complex diagnostic case challenges from the New England Journal of Medicine.

AMIE exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs. 33.6%, p= 0.04). Comparing the two assisted study arms, the top-10 accuracy was higher for clinicians assisted by AMIE, compared to clinicians without AMIE assistance (24.6%, p<0.01) and clinicians with search (5.45%, p=0.02). Further, clinicians assisted by AMIE arrived at more comprehensive differential lists than those without AMIE assistance.

In addition to strong standalone performance, using the AMIE system led to significant assistive effect and improvements in diagnostic accuracy of the clinicians in solving these complex case challenges.

It’s worth noting that NEJM CPCs are not representative of everyday clinical practice. They are unusual case reports in only a few hundred individuals so offer limited scope for probing important issues like equity or fairness.

Bold and responsible research in healthcare — the art of the possible

Access to clinical expertise remains scarce around the world. While AI has shown great promise in specific clinical applications, engagement in the dynamic, conversational diagnostic journeys of clinical practice requires many capabilities not yet demonstrated by AI systems. Doctors wield not only knowledge and skill but a dedication to myriad principles, including safety and quality, communication, partnership and teamwork, trust, and professionalism. Realizing these attributes in AI systems is an inspiring challenge that should be approached responsibly and with care. AMIE is our exploration of the “art of the possible”, a research-only system for safely exploring a vision of the future where AI systems might be better aligned with attributes of the skilled clinicians entrusted with our care. It is early experimental-only work, not a product, and has several limitations that we believe merit rigorous and extensive further scientific studies in order to envision a future in which conversational, empathic and diagnostic AI systems might become safe, helpful and accessible.

Acknowledgements

The research described here is joint work across many teams at Google Research and Google Deepmind. We are grateful to all our co-authors – Tao Tu, Mike Schaekermann, Anil Palepu, Daniel McDuff, Jake Sunshine, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Sara Mahdavi, Karan Sighal, Shekoofeh Azizi, Nenad Tomasev, Yun Liu, Yong Cheng, Le Hou, Albert Webson, Jake Garrison, Yash Sharma, Anupam Pathak, Sushant Prakash, Philip Mansfield, Shwetak Patel, Bradley Green, Ewa Dominowska, Renee Wong, Juraj Gottweis, Dale Webster, Katherine Chou, Christopher Semturs, Joelle Barral, Greg Corrado and Yossi Matias. We also thank Sami Lachgar, Lauren Winer and John Guilyard for their support with narratives and the visuals. Finally, we are grateful to Michael Howell, James Maynika, Jeff Dean, Karen DeSalvo, Zoubin Gharahmani and Demis Hassabis for their support during the course of this project.

Read More

AMIE: A research AI system for diagnostic medical reasoning and conversations

AMIE: A research AI system for diagnostic medical reasoning and conversations

The physician-patient conversation is a cornerstone of medicine, in which skilled and intentional communication drives diagnosis, management, empathy and trust. AI systems capable of such diagnostic dialogues could increase availability, accessibility, quality and consistency of care by being useful conversational partners to clinicians and patients alike. But approximating clinicians’ considerable expertise is a significant challenge.

Recent progress in large language models (LLMs) outside the medical domain has shown that they can plan, reason, and use relevant context to hold rich conversations. However, there are many aspects of good diagnostic dialogue that are unique to the medical domain. An effective clinician takes a complete “clinical history” and asks intelligent questions that help to derive a differential diagnosis. They wield considerable skill to foster an effective relationship, provide information clearly, make joint and informed decisions with the patient, respond empathically to their emotions, and support them in the next steps of care. While LLMs can accurately perform tasks such as medical summarization or answering medical questions, there has been little work specifically aimed towards developing these kinds of conversational diagnostic capabilities.

Inspired by this challenge, we developed Articulate Medical Intelligence Explorer (AMIE), a research AI system based on a LLM and optimized for diagnostic reasoning and conversations. We trained and evaluated AMIE along many dimensions that reflect quality in real-world clinical consultations from the perspective of both clinicians and patients. To scale AMIE across a multitude of disease conditions, specialties and scenarios, we developed a novel self-play based simulated diagnostic dialogue environment with automated feedback mechanisms to enrich and accelerate its learning process. We also introduced an inference time chain-of-reasoning strategy to improve AMIE’s diagnostic accuracy and conversation quality. Finally, we tested AMIE prospectively in real examples of multi-turn dialogue by simulating consultations with trained actors.

AMIE was optimized for diagnostic conversations, asking questions that help to reduce its uncertainty and improve diagnostic accuracy, while also balancing this with other requirements of effective clinical communication, such as empathy, fostering a relationship, and providing information clearly.

Evaluation of conversational diagnostic AI

Besides developing and optimizing AI systems themselves for diagnostic conversations, how to assess such systems is also an open question. Inspired by accepted tools used to measure consultation quality and clinical communication skills in real-world settings, we constructed a pilot evaluation rubric to assess diagnostic conversations along axes pertaining to history-taking, diagnostic accuracy, clinical management, clinical communication skills, relationship fostering and empathy.

We then designed a randomized, double-blind crossover study of text-based consultations with validated patient actors interacting either with board-certified primary care physicians (PCPs) or the AI system optimized for diagnostic dialogue. We set up our consultations in the style of an objective structured clinical examination (OSCE), a practical assessment commonly used in the real world to examine clinicians’ skills and competencies in a standardized and objective way. In a typical OSCE, clinicians might rotate through multiple stations, each simulating a real-life clinical scenario where they perform tasks such as conducting a consultation with a standardized patient actor (trained carefully to emulate a patient with a particular condition). Consultations were performed using a synchronous text-chat tool, mimicking the interface familiar to most consumers using LLMs today.

AMIE is a research AI system based on LLMs for diagnostic reasoning and dialogue.

AMIE: an LLM-based conversational diagnostic research AI system

We trained AMIE on real-world datasets comprising medical reasoning, medical summarization and real-world clinical conversations.

It is feasible to train LLMs using real-world dialogues developed by passively collecting and transcribing in-person clinical visits, however, two substantial challenges limit their effectiveness in training LLMs for medical conversations. First, existing real-world data often fails to capture the vast range of medical conditions and scenarios, hindering the scalability and comprehensiveness. Second, the data derived from real-world dialogue transcripts tends to be noisy, containing ambiguous language (including slang, jargon, humor and sarcasm), interruptions, ungrammatical utterances, and implicit references.

To address these limitations, we designed a self-play based simulated learning environment with automated feedback mechanisms for diagnostic medical dialogue in a virtual care setting, enabling us to scale AMIE’s knowledge and capabilities across many medical conditions and contexts. We used this environment to iteratively fine-tune AMIE with an evolving set of simulated dialogues in addition to the static corpus of real-world data described.

This process consisted of two self-play loops: (1) an “inner” self-play loop, where AMIE leveraged in-context critic feedback to refine its behavior on simulated conversations with an AI patient simulator; and (2) an “outer” self-play loop where the set of refined simulated dialogues were incorporated into subsequent fine-tuning iterations. The resulting new version of AMIE could then participate in the inner loop again, creating a virtuous continuous learning cycle.

Further, we also employed an inference time chain-of-reasoning strategy which enabled AMIE to progressively refine its response conditioned on the current conversation to arrive at an informed and grounded reply.

AMIE uses a novel self-play based simulated dialogue learning environment to improve the quality of diagnostic dialogue across a multitude of disease conditions, specialities and patient contexts.

We tested performance in consultations with simulated patients (played by trained actors), compared to those performed by 20 real PCPs using the randomized approach described above. AMIE and PCPs were assessed from the perspectives of both specialist attending physicians and our simulated patients in a randomized, blinded crossover study that included 149 case scenarios from OSCE providers in Canada, the UK and India in a diverse range of specialties and diseases.

Notably, our study was not designed to emulate either traditional in-person OSCE evaluations or the ways clinicians usually use text, email, chat or telemedicine. Instead, our experiment mirrored the most common way consumers interact with LLMs today, a potentially scalable and familiar mechanism for AI systems to engage in remote diagnostic dialogue.

Overview of the randomized study design to perform a virtual remote OSCE with simulated patients via online multi-turn synchronous text chat.

Performance of AMIE

In this setting, we observed that AMIE performed simulated diagnostic conversations at least as well as PCPs when both were evaluated along multiple clinically-meaningful axes of consultation quality. AMIE had greater diagnostic accuracy and superior performance for 28 of 32 axes from the perspective of specialist physicians, and 24 of 26 axes from the perspective of patient actors.

AMIE outperformed PCPs on multiple evaluation axes for diagnostic dialogue in our evaluations.
Specialist-rated top-k diagnostic accuracy. AMIE and PCPs top-k differential diagnosis (DDx) accuracy are compared across 149 scenarios with respect to the ground truth diagnosis (a) and all diagnoses listed within the accepted differential diagnoses (b). Bootstrapping (n=10,000) confirms all top-k differences between AMIE and PCP DDx accuracy are significant with p <0.05 after false discovery rate (FDR) correction.
Diagnostic conversation and reasoning qualities as assessed by specialist physicians. On 28 out of 32 axes, AMIE outperformed PCPs while being comparable on the rest.

Limitations

Our research has several limitations and should be interpreted with appropriate caution. Firstly, our evaluation technique likely underestimates the real-world value of human conversations, as the clinicians in our study were limited to an unfamiliar text-chat interface, which permits large-scale LLM–patient interactions but is not representative of usual clinical practice. Secondly, any research of this type must be seen as only a first exploratory step on a long journey. Transitioning from a LLM research prototype that we evaluated in this study to a safe and robust tool that could be used by people and those who provide care for them will require significant additional research. There are many important limitations to be addressed, including experimental performance under real-world constraints and dedicated exploration of such important topics as health equity and fairness, privacy, robustness, and many more, to ensure the safety and reliability of the technology.

AMIE as an aid to clinicians

In a recently released preprint, we evaluated the ability of an earlier iteration of the AMIE system to generate a DDx alone or as an aid to clinicians. Twenty (20) generalist clinicians evaluated 303 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) ClinicoPathologic Conferences (CPCs). Each case report was read by two clinicians randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or AMIE assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools.

Assisted randomized reader study setup to investigate the assistive effect of AMIE to clinicians in solving complex diagnostic case challenges from the New England Journal of Medicine.

AMIE exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs. 33.6%, p= 0.04). Comparing the two assisted study arms, the top-10 accuracy was higher for clinicians assisted by AMIE, compared to clinicians without AMIE assistance (24.6%, p<0.01) and clinicians with search (5.45%, p=0.02). Further, clinicians assisted by AMIE arrived at more comprehensive differential lists than those without AMIE assistance.

In addition to strong standalone performance, using the AMIE system led to significant assistive effect and improvements in diagnostic accuracy of the clinicians in solving these complex case challenges.

It’s worth noting that NEJM CPCs are not representative of everyday clinical practice. They are unusual case reports in only a few hundred individuals so offer limited scope for probing important issues like equity or fairness.

Bold and responsible research in healthcare — the art of the possible

Access to clinical expertise remains scarce around the world. While AI has shown great promise in specific clinical applications, engagement in the dynamic, conversational diagnostic journeys of clinical practice requires many capabilities not yet demonstrated by AI systems. Doctors wield not only knowledge and skill but a dedication to myriad principles, including safety and quality, communication, partnership and teamwork, trust, and professionalism. Realizing these attributes in AI systems is an inspiring challenge that should be approached responsibly and with care. AMIE is our exploration of the “art of the possible”, a research-only system for safely exploring a vision of the future where AI systems might be better aligned with attributes of the skilled clinicians entrusted with our care. It is early experimental-only work, not a product, and has several limitations that we believe merit rigorous and extensive further scientific studies in order to envision a future in which conversational, empathic and diagnostic AI systems might become safe, helpful and accessible.

Acknowledgements

The research described here is joint work across many teams at Google Research and Google Deepmind. We are grateful to all our co-authors – Tao Tu, Mike Schaekermann, Anil Palepu, Daniel McDuff, Jake Sunshine, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Sara Mahdavi, Karan Sighal, Shekoofeh Azizi, Nenad Tomasev, Yun Liu, Yong Cheng, Le Hou, Albert Webson, Jake Garrison, Yash Sharma, Anupam Pathak, Sushant Prakash, Philip Mansfield, Shwetak Patel, Bradley Green, Ewa Dominowska, Renee Wong, Juraj Gottweis, Dale Webster, Katherine Chou, Christopher Semturs, Joelle Barral, Greg Corrado and Yossi Matias. We also thank Sami Lachgar, Lauren Winer and John Guilyard for their support with narratives and the visuals. Finally, we are grateful to Michael Howell, James Maynika, Jeff Dean, Karen DeSalvo, Zoubin Gharahmani and Demis Hassabis for their support during the course of this project.

Read More