The reality of generative AI in the clinic

The reality of generative AI in the clinic

AI Revolution podcast | Episode 1 - The reality of generative AI in the clinic | outline illustration of Dr. Sara Murray, Peter Lee, Dr. Christopher Longhurst

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series—The AI Revolution in Medicine, Revisited—Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee.

In this episode, Dr. Christopher Longhurst (opens in new tab) and Dr. Sara Murray (opens in new tab), leading experts in healthcare AI implementation, join Lee to discuss the current state and future of AI in clinical settings. Longhurst, chief clinical and innovation officer at UC San Diego Health and executive director of the Jacobs Center for Health Innovation, details his healthcare system’s collaboration with Epic and Microsoft to integrate GPT into their electronic health record system, offering clinicians support in responding to patient messages. Dr. Murray, chief health AI officer at UC San Francisco Health, discusses AI’s integration into clinical workflows, the promise and risks of AI-driven decision-making, and how generative AI is reshaping patient care and physician workload.


Learn more:

Large Language Models for More Efficient Reporting of Hospital Quality Measures (opens in new tab)
Publication | October 2024 

Generative artificial intelligence responses to patient messages in the electronic health record: early lessons learned (opens in new tab)
Publication | July 2024 

The Chief Health AI Officer — An Emerging Role for an Emerging Technology (opens in new tab)
Publication | June 2024 

AI-Generated Draft Replies Integrated Into Health Records and Physicians’ Electronic Communication (opens in new tab) 
Publication | April 2024 

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum (opens in new tab)
Publication | April 2023

The AI Revolution in Medicine: GPT-4 and Beyond
Book | April 2023

Transcript 

[MUSIC] 

[BOOK PASSAGE]  

PETER LEE: “The workload on healthcare workers in the United States has increased dramatically over the past 20 years, and in the worst way possible. … Far too much of the practical, day-to-day work of healthcare has evolved into a crushing slog of filling out and handling paperwork. … GPT-4 indeed looks very promising as a foundational technology for relieving doctors of many of the most taxing and burdensome aspects of their daily jobs.” 

[END OF BOOK PASSAGE] 

[THEME MUSIC] 

This is The AI Revolution in Medicine, Revisited. I’m your host, Peter Lee. 

Shortly after OpenAI’s GPT-4 was publicly released, Carey Goldberg, Dr. Zak Kohane, and I published The AI Revolution in Medicine to help educate the world of healthcare and medical research about the transformative impact this new generative AI technology could have. But because we wrote the book when GPT-4 was still a secret, we had to speculate. Now, two years later, what did we get right, and what did we get wrong?  

In this series, we’ll talk to clinicians, patients, hospital administrators, and others to understand the reality of AI in the field and where we go from here. 


[THEME MUSIC FADES]

What I read there at the top is a passage from Chapter 2 of the book, which captures part of what we’re going to cover in this episode. 

In our book, we predicted how AI would be leveraged in the clinic. Some of those predictions, I felt, were slam dunks, for example, AI being used to listen to doctor-patient conversations and write clinical notes. There were already early products coming out in the world not using generative AI that were doing just that. But other predictions we made were bolder, for instance, on the use of generative AI as a second set of eyes, to look over the shoulder of a doctor or a nurse or a patient and spot mistakes.

In this episode, I’m pleased to welcome Dr. Chris Longhurst and Dr. Sara Murray to talk about how clinicians in their respective systems are using AI, their reactions to it, and what’s ahead. Chris is the chief clinical and innovation officer at UC San Diego Health, and he is also the executive director of the Joan & Irwin Jacobs Center for Health Innovation. He’s in charge of UCSD Health’s digital strategy, including the integration of new technologies from bedside to bench and reaching across UCSD Health, the School of Medicine, and the Jacobs School of Engineering. Chris is a board-certified pediatrician and clinical informaticist.

Sara is vice president and chief health AI officer at UC San Francisco Health. Sara is an internal medicine specialist and associate professor of clinical medicine. A doctor, a professor of medicine, and a strategic health system leader, she builds infrastructure and governance processes to ensure that UCSF’s deployment of AI, including both AI procured from companies as well as AI-powered tools developed in-house, are trustworthy and ethical.

I’ve known Chris and Sara for years, and what’s really impressed me about their work—and frankly, the work of all the guests we’ll have on the show—is that they’ve all done something significant to advance the use of AI in healthcare.

[TRANSITION MUSIC]

Here’s my conversation with Dr. Chris Longhurst:  

LEE: Chris, thank you so much for joining us today. 

CHRISTOPHER LONGHURST: Peter, it’s a pleasure to be here. Really appreciate it. 

LEE: We’re going to get into, you know, what’s happening in the clinic with AI. But I think we need to find out a little bit more about you first. I introduced you as a person with a fancy title, chief clinical and innovation officer. What is that exactly, and how do you spend a typical day at work? 

LONGHURST: Well, I have a little bit of a unicorn job because my portfolio includes information technology, and I’m a recovering CIO after spending seven years in that role. It also includes quality patient safety, case management, and the office of our chief medical officer.  

And so I’m really trying to unify our mission to deliver highly reliable care with these new tools in a way that allows us to transform that care. One good analogy, I think, is it’s about the game, right. Our job is not only to play the game and win the game using the existing tools but also to change the game by leveraging these new tools and showing the rest of the country how that can be done. 

LEE: And so as you’re doing that, I can understand, of course, you’re working at a very, kind of, senior executive level. But, you know, when I’ve visited you at UCSD Health, you’re also working with clinicians, doctors, and nurses all the time. In a way, I viewed you as, sort of, connective tissue between these things. Is that accurate? 

LONGHURST: Well, sure. And we’ve got, you know, several physicians who are part of the executive team who are also continuing to practice, and I think that’s one of the ways in which doctors on the executive team can bring value, is being that connective tissue, being the ears on the ground and a little dose of reality. 

LEE: [LAUGHS] Well, in fact, that reality is really what I want to delve into. But I just want to, before getting into that, talk a little bit about AI and your encounters with AI. And I think we have to do it in two stages because there is AI and machine learning and data analytics prior to the rise of generative AI and then, of course, after. And so tell us a little bit about, you know, what got you into health informatics and AI to begin with. 

LONGHURST: Well, Peter, I know that you play video games, and I did too for many years. So I was an early John Carmack id Software, Castle Wolfenstein, and Doom fan.  

LEE: Love it.  

LONGHURST: And that kept me occupied because I lived out in the country on 50 acres of almond trees. And so it was computer gaming that first got me into computers.  

But during medical school, I decided to pursue graduate work in this field called health informatics. And actually my master’s thesis was using machine learning to help identify and distinguish innocent from pathologic heart murmurs in children. And I worked with Dr. Nancy Reed at UC Davis, who had programmed using Lisp, a really fancy tool to do exactly that.  

And I will tell you that if I never see another parentheses in Lisp code again, it’ll be too soon. So I spent a solid year on that. 

LEE: [LAUGHS] No, no, but you should wear that as a badge of honor. And I will guess that no other guest on this podcast series will have programmed in Lisp. So kudos to you. 

LONGHURST: [LAUGHS] Well, it was a lot of work, and I learned a lot, but as you can imagine, it wasn’t highly successful at the time. And fast forward, we’ve had lots of traditional machine learning kind of activities using discrete data for predictive analytics to help predict flow in the hospital and even sepsis, which we can talk about. But as you said, the advent of generative AI in the fall of 2022 was a real game-changer. 

LEE: Well, you have this interest in technology, and, in fact, I do know you as a fairly intensely geeky person. Really, I think maybe that’s one reason why we’ve been attracted to each other. But you also got drawn into medicine. Where did that come from? 

LONGHURST: So my father was a practicing cardiologist and scientist. He was MD, PhD trained, and he really shared with me both a love of medicine but also science. I worked in his lab for three summers, and it was during college I decided I wanted to apply to medical school because the human side of the science really drew me in.  

But my father was the one who really identified it was important to cross-train. And that’s why I decided to take time off to do that master’s degree in health informatics and see if I could figure out how to take two disparate fields and really combine them into one.  

I actually went down to Stanford to become a pediatrician because they have a standalone children’s hospital that’s one of the best in the country. And I still practice pediatrics and see newborns, and it’s a passion for me and part of my identity.

LEE: Well, I’m just endlessly fascinated and impressed with people who can span these two worlds in the way that you’ve done. So now, you know, 2022, in November, ChatGPT gets released to the world, and then, you know, a few months later, GPT-4, and then, of course, in the last two years, so much has happened. But what was your first encounter with what we now know of as generative AI? 

LONGHURST: So I remember when ChatGPT was released, and, you know, some of my computer science-type of nerd friends, we were on text threads, you know, with a lot of mind-blowing emojis. But when it really hit medicine was when I got a call right after Thanksgiving in 2022 from my colleague. He was playing with ChatGPT, and he said to me, Chris, I’ve been feeding it patient questions and you wouldn’t believe the responses. And he emailed some of the examples to me, and my mind was blown.

And so that’s when I became one of the reviewers on the paper that was published in April of 2023 that showed not only could ChatGPT help answer questions from patients in a high-quality way, but it also expressed a tremendous amount of empathy.[1] And in fact, in our review, the clickbait headlines that came out of the paper were that the chatbot was both higher quality and more empathetic than doctors.

But that wasn’t my takeaway at all. In fact, I’ll take my doctors any day and put them against your chatbot if you give them an hour to Google and construct a really long, thoughtful response. To me, part of the takeaway was that this was really an opportunity to improve efficiency and save time. And so I called up our colleagues at Epic. I think it was right around December of 2022. And I said, Sumit, have you seen this? I’d like to share some results with you. And I showed him the data from our paper before we had actually had it published. And he said, “Well, that’s great because we’re working with Peter Lee and the team at Microsoft to integrate GPT into Epic.”  

And so, of course, that’s how we became one of the first two sites in the country to roll out GPT inside our electronic health record to help draft answers to patient questions.  

LEE: And, you know, one thing that’s worth emphasizing in the story that you’ve just told is that there is no other major health system that has been confronting the reality of generative AI longer than UC San Diego Health—and I think largely because of your drive and early adoption.  

And many listeners of this podcast will know what Epic is, but many will not. And so it’s worth saying that Epic is a very important creator of an electronic health records system. And of course, UC San Diego Health uses Epic to store all of the clinical data for its patients.  

And then Sumit is, of course, Sumit Rana, who is president at Epic.  

LONGHURST: So in partnership with Epic, we decided to tackle a really important challenge in healthcare today, which is, particularly since the pandemic and the increase in virtual and telehealth care, our clinicians get more messages than ever from patients. But answering those asynchronous messages is an unreimbursed, noncompensated activity that can often take time after hours—what we call “pajama time”—for our doctors.  

And in truth, you know, health systems that have thought through this, most of the answers are not actually generated by the doctors themselves. Many times, it’s mid-level providers, protocol schedulers, other things, because the questions can be about anything from rescheduling an appointment to a medication refill. They don’t all require doctors.  

When they do, it’s a more complicated question, and sometimes can require a more complicated answer. And in many cases, the clinicians will see a long complex question, and rather than typing an answer, they’ll say, “You know, this is complicated. Why don’t you schedule a visit with me so we can talk about it more?” 

LEE: Yeah, so now you’ve made a decision to contact people at Epic to what … posit the idea that AI might be able to make responding to patient queries easier? Is that the story here?

LONGHURST: That’s exactly right. And Sumit knew well that this is a challenge across many organizations. This is not unique to UC San Diego or Stanford. And there’s been a lot of publications about it. It’s even been in the lay press. So our hypothesis was that using GPT to help draft responses for doctors would save them time, make it easier, and potentially result in higher-quality, more empathetic answers to patients. 

LEE: And so now the thing that I was so impressed with is you actually did a carefully controlled study to try to understand how well does that work. So tell us a little bit first about the results of that study but then how you set it up. 

LONGHURST: Sure. Well, first, I want to acknowledge something you said at the beginning, which is one of my hats is the executive director of the Joan & Irwin Jacobs Center for Health Innovation. And we’re incredibly grateful to the Jacobs for their gift, which has allowed us to not only implement AI as part of hospital operations but also to have resources that other health systems may not have to be able to study outcomes. And so that really enabled what we’re going to talk about here. 

LEE: Right. By the way, one of the things I was personally so fascinated by is, of course, in our book, we speculated that things like after-visit notes to patients, responding to patient queries might be something that happens. And you, at the same time we were writing the book, were actually actively trying to make that real, which is just incredible and for me, and I think my coauthors, pretty affirming. 

LONGHURST: I think you guys were really prescient in your vision. The book is tremendous. I have a signed copy of Peter’s book, and I recommend it for all your listeners. [LAUGHTER]  

LEE: All right, so now what have you found about … 

LONGHURST: Yeah. 

LEE: … generative AI?  

LONGHURST: Yeah. Well, first to understand what we found, you have to understand how we built [the AI inbox response tool]. And so Stanford and UC San Diego really collaborated with Epic on designing what this would look like. So doctor gets that patient message. We feed some information to GPT that’s not only the message but also some information about the patient—their problems and medications and past medical and surgical history and that sort of thing. 

LEE: Is there a privacy concern that patients should be worried about when that happens? 

LONGHURST: Yeah, it’s a really good question. There’s not because we’re operating in partnership with Epic and Microsoft in a HIPAA-compliant cloud. And so that data is not only secure and private, but that’s our top priority, is keeping it that way. 

LEE: Great. 

LONGHURST: So once we feed that into GPT, of course, we very quickly get a draft message that we could send to a patient. But we chose not to just send that message to a patient. So part of our AI governance is keeping a human in the loop. And there’s two buttons that allow that clinician to review the message. One button says Edit draft message, and the other button says Start new blank message. So there’s no button that says just Send now. And that really is illustrative of the approach that we took. The second thing, though, that we chose to do I think is really interesting from a conversation standpoint is that our AI governance, as they were looking at this, said, “You know, AI is new and novel. It can be scary to patients. And if we want to maximize trust with our patients, we should maximize transparency.” And so anytime a clinician uses the button that says Edit draft response, we automatically append something in the message that says, “This message was automatically generated and reviewed and edited by your doctor.” We felt strongly that was the right approach, and we’ve had a lot of positive feedback. 

LEE: And so we’ll want to get into, you know, how good these messages are, whether there are issues with bias or hallucination, but before doing that, you know, on this human in loop, this was another theme in our book. And in fact, we recommended this. But there were other health systems around the country that were also later experimenting with similar ideas. And some have taken different approaches. In fact, as time has gone on, if anything, it seems like it’s become a little bit less clear, this sort of labeling idea. Has your view on this evolved at all over the last two years? 

LONGHURST: First of all, I’m glad that we did it. I think it was the right choice for University of California, and in fact, the other four UC sites are all doing this, as well. There is variability across the organizations that are using this functionality, and as you suggest, there’s tens of thousands of physicians and hundreds of thousands if not millions of patients receiving these messages. And it’s been highlighted a bit in the press.  

I can tell you that talking about our approach to transparency, one of our lawmakers in the state of California heard about this and actually proposed a bill that was signed into legislation by our governor so that effective Jan. 1, any communication with patients that uses AI has to be disclosed with those patients. And so there is some thought that this is perhaps the right approach.  

I don’t think that it’s a perfect approach, though. We’re using AI in more and more ways, and it’s not as if we’re going to be able to disclose every single time that we’re doing it to prioritize, you know, scheduling for the sickest patients or to help operationally on billing or something else. And so I think that there are other ways we need to figure it out. But we have called on national societies and others to try to create some guidelines around this because we should be as transparent as we can with our patients. 

LEE: Obviously, one of the issues—and we highlighted this a lot in our book—is the problem of hallucination. And surely this must be an issue when you’re having AI draft these notes to patients. What have you found? 

LONGHURST: We were worried about that when we rolled it out. And what we found is not only were there very few hallucinations, in some cases, our doctors were learning from the GPT. And I can give you an example. When a patient who had had a visit wrote their doctor afterwards and said, “Doc, I’ve been thinking a lot about what we discussed in quitting smoking marijuana.” And the GPT draft reply said something to the effect of, “That’s great news. Here’s a bunch of evidence on how smoking marijuana can harm your lungs and cause other effects. And by the way, since you live in the state of California, here’s the marijuana quitters helpline.” And the doctor who was sending this called me up to tell me about it. And I said, well, is there a marijuana quitters helpline in the state of California? And he said, “I didn’t know, so I Googled it. And yeah, there is.” And so that’s an example of the GPT actually having more information than, you know, a primary care clinician might have. And so there are cases clearly where the GPT can help us increase the quality. In addition, some of the feedback that we’ve been getting both anecdotally and now measuring is that these draft responses do carry that tone of empathy that Dr. [John] Ayers [2] and I saw in the original manuscript. And we’ve heard from our clinicians that it’s reminding them to be empathetic because you don’t always have that time when you’re hammering out a quick short message, right?  

LEE: You know, I think the thing that we’ve observed, and we’ve discussed this also, is exactly that reminding thing. There might be in the encounter between a doctor and patient, maybe a conversation about, you know, going to a football game for the first time. That could be part of the conversation. But in a busy doctor’s life, when writing a note, you might forget about that. And, of course, an AI has the endless ability to remember that it might be friendly to send well wishes. 

LONGHURST: Exactly right, Peter. In fact, one of the findings in Dr. Ayers’s manuscript that didn’t get as much attention but I think is really important was the difference in length between the responses. So I was one of the putatively blinded reviewers, but as I was looking at the questions and answers, it was really obvious which ones were the chatbot and which ones were the doctors because the chatbot was always, you know, three or four paragraphs and the doctor was three or four sentences, right. It’s about time. And so we saw that in the results of our study.  

LEE: All right, so now let’s get into those results.

LONGHURST: OK. Well, first of all, my hypothesis was that this would help us save time, and I was wrong. It turns out a busy primary care clinician might get about 30 messages a day from patients, and each one of those messages might take about 30 seconds to type a quick response, a two-sentence response, a dot phrase, a macro. Your labs are normal. No need to worry. I’ll call you if anything comes up. After we implemented the AI tool, it still took about 30 seconds per message to respond. But we saw that the responses were two to three times longer on average, and they carried a more empathetic tone. [3] And our physicians told us it decreased cognitive burden, which is not surprising because any of you have written know that it’s much easier to edit somebody else’s copy than it is to face a blank screen, right. That’s why I like to be senior author, not lead author.

And so the tool actually helped quite a bit, but it didn’t help in the ways that we had expected necessarily. There are some other sites that have now found a little bit of time savings, but it’s really nominal overall. The Stanford study (opens in new tab) that was done at the same time—and we actually had some shared coauthors—measured physician burnout using a validated survey, and they saw a decrease in measured physician burnout. And so there are clear advantages to this, and we’re still learning more.

In fact, we’ve now rolled this out not only to all of our physicians, but to all of our nurses who help answer those messages in many different clinics. And one of the things that we’re finding—and Dr. CT Lin at University of Colorado recently published (opens in new tab)—is that this tool might actually help those mid-level providers even more because it’s really good at protocolized responses. I mentioned at the beginning, some of the questions that come to the physicians may be more the edge cases that require a little bit less protocolized kind of answers. And so as we get into academic subspecialties like gynecology oncology, the GPT might not be dishing up a draft message that’s quite as useful. But if you’re a nurse in obstetrics and you’re getting very routine pregnancy questions, it could save a ton of time. And so we’ve rolled this out broadly.  

I want to acknowledge the partnership with Seth Hain and the team at Epic, who’ve just been fantastic. And we’re finding all sorts of new ways to integrate the GPT tools into our electronic health record, as well. 

LEE: Yeah. Certainly the doctors and nurses that I’ve encountered that have access to this feature, they just don’t want to give it up. But it’s so interesting that it actually doesn’t really save time. Is that a problem? Because, of course, you know, there seems to be a workforce shortage in healthcare, a need to lower costs and have greater efficiencies. You know, how do you think about that?

LONGHURST: Great question. There are so many opportunities, as you’ve kind of mentioned. I mean, healthcare is full of waste and inefficiency, and I am super bullish on how these generative AI tools are going to help us reduce some of that inefficiency.  

So everything from revenue cycle to our call centers to operations efficiency, I think, can be positively impacted, and those things make more resources available for clinicians and others. When we think about, you know, saving clinicians time, I don’t think it’s necessarily, sort of, the communicating with patients where you want to save that time actually. I think what we want to do is we want to offload some of those administrative tasks that, you know, take a lot of time for our physicians.  

So we’ve measured “pajama time” in our doctors, and on average, a busy primary care clinician can spend one to two hours after clinic doing things. But only about 15 minutes is answering messages from patients. Actually, the bulk of the time after hours is documenting the notes that are required from those visits, right. And those notes are used for a number of different purposes, not only communicating to the next doctor who sees the patient but also for billing purposes and compliance purposes and medical legal purposes. So another really exciting area is AI scribes. 

LEE: Yeah. And so, you know, we’ll get into scribes and actually other possibilities. I wonder, though, about this empathy issue. Because as computer scientists, we know that you can fall into traps if you anthropomorphize these AI systems or any machine. So in this study, how was that measured, and how real do think that is? 

LONGHURST: So in the study, you’ll see anecdotal or qualitative evidence about empathy. We have a follow-up study that will be published soon where we’ve actually measured empathy using some more quantitative tools, and there is no doubt that the chatbot-generated drafts are coming through with more empathy. And we’ve heard this from a number of our doctors, so it’s not surprising. Here’s one of the more surprising things though. I published a paper last year with Dr. Sally Baxter (opens in new tab), one of our ophthalmologists, and she actually looked at messages with a negative tone. It turns out, not surprisingly, healthcare can be frustrating. And stressed patients can send some pretty nasty messages to their care teams. [LAUGHTER] And you can imagine being a busy, …  

LEE: I’ve done it. [LAUGHS]

LONGHURST: … tired, exhausted clinician, and receiving a bit of a nasty gram from one of your patients can be pretty frustrating. And the GPT is actually really helpful in those instances in helping draft a pretty empathetic response when I think the human instinct would be a pretty nasty one. [LAUGHTER] I should probably use it in my email, Peter. 

LEE: And is the patient experience, the actually lived experience of patients when they receive these notes, are you absolutely convinced and certain that they are also benefiting from this empathetic tone? 

LONGHURST: I am. In fact, in our paper, we also found that the messages going to patients that had been drafted with the AI tool were two to three times longer (opens in new tab) than the messages going to patients that weren’t using the drafts. And so it’s clear there’s more content going and that content is either contributing to a greater sense of empathy and relationship among the patients as well as the clinicians, and/or in some cases, that content may be educating the patients or even reducing the need for follow-up visits.  

LEE: Yeah, so now I think an important thing to share with the audience here is, you know, healthcare, of course, is a very highly regulated industry for good reasons. There are issues of safety and privacy that have to be guarded very, very carefully and thoroughly. And for that reason, clinical studies oftentimes have very carefully developed controls and randomization setups. And so to what extent was that done in this case? Because here, it’s not like you’re testing a new drug. It’s something that’s a little fuzzier, isn’t it?

LONGHURST: Yeah, that’s right, Peter. And credit to the lead author, Dr. Ming Tai-Seale, we actually did randomize. And so that’s unusual in these type of studies. We actually got IRB [institutional review board] exemption to do this as a randomized QI study. And it was a crossover study because all the doctors wanted the functionality. So what we tested was the early adopters versus the late adopters. And we compared at the same time the early adopters to those who weren’t using the functionality and then later the late adopters to the folks that weren’t using the functionality. 

 LEE: And in that type of study, you might also, depending on how the randomization is set up, also have to have doctors some days using it and some days not having access. Did that also happen? 

LONGHURST: We did, but it wasn’t on a day-to-day basis. It was more a month-to-month basis. 

LEE: Uh-huh. And what kind of conversation do you have with a doctor that might be attached to a technology and then be told for the next month you don’t get to use it?  

LONGHURST: [LAUGHS] The good news is because of a doctor’s medical training, they all understood the need for it. And the conversation was sort of, hey, we’re going to need you to stop using that for a month so that we can compare it, but we’ll give it back to you afterwards. 

LEE: [LAUGHS] OK, great. All right. So now we made some other predictions. So we talked about, you know, responding to patients. You briefly mentioned clinical note-taking. We also made guesses about other types of paperwork, you know, filling out prior authorization requests or referral letters, maybe for a doctor to refer to a specialist. We even made some guesses about a second set of eyes on medications, on various treatment options, diagnoses. What of these things have happened and what hasn’t happened, at least in your clinical experience?

LONGHURST: Your guesses were spot on. And I would say almost all of them have already happened and are happening today at UC San Diego and many other health systems. We have a HIPAA-compliant GPT instance that can be used for things like generating patient letters, generating referral letters, even generating patient education with patient-friendly language. And that’s a common use case. The second set of eyes on medications is something that we’re exploring but have not yet rolled out. One of the areas I’m really excited about is reporting. So Johns Hopkins did a study a couple of years ago that showed an average academic medical center our size spends about $5 million annually (opens in new tab) just reporting on quality measures that are regulatory requirements. And that’s about accurate for us. 

We published a paper just last fall showing that large language models could help to pre-populate quality data (opens in new tab) for things like sepsis reporting in a really effective way. It was like 91% accurate. And so that’s a huge time savings and efficiency opportunity. Again, allows us to redeploy those qualities staff. We’re now looking at things like how do we use large language models to review charts for peer review to help ensure ongoing, you know, accuracy and mitigate risk. I’m really passionate about the whole space of using AI to improve quality and patient safety in particular.  

Your readers may be familiar with the famous report in 1999, “To Err is Human (opens in new tab),” that suggests a hundred thousand Americans die on an annual basis from medical errors. And unfortunately the data shows we really haven’t made great progress in 25 years, but these new tools give us the opportunity to impact that in a really meaningful way. This is a turning point in healthcare.

LEE: Yeah, medication errors—actually, all manner of medical errors—I think has been just such a frustrating problem. And, you know, I think this gives us some new hope. Well, let’s look ahead a little bit. And just to be a little bit provocative, you know, one question that I get asked a lot by both patients and clinicians is, you know, “Will AI replace doctors sometime in the future?” What are your thoughts? 

LONGHURST: So the pat response is AI won’t replace doctors, but AI will replace doctors who don’t use AI. And the implication there, of course, is that a doctor using AI will end up being a more effective practitioner than a doctor who doesn’t. And I think that’s absolutely true. From a medical legal standpoint, what is standard of care today and what is standard of care five or 10 years from now will be different. And I think there will be a point where doctors who aren’t using AI regularly, it would almost be unconscionable.  

LEE: Yeah, I think there are already some areas where we’ve seen this happen. My favorite example is with the technology of ultrasound, where if you’re a gynecologist or some part of internal medicine, there are some diagnostic procedures where it would really be malpractice not to use ultrasound. Whereas in the late 1950s, the safety and also the doctor training to read ultrasound images were all called into question. And so let’s look ahead two years from now, five years from now, 10 years from now. And on those three time frames, you know, what do you think—based on the practice of medicine today, what doctors and nurses are doing in clinic every day today—what do you think the biggest differences will be two years from now, five years from now, and 10 years from now? 

LONGHURST: Great question, Peter. So first of all, 10 years from now, I think that patients will be still coming to clinic. Doctors will still be seeing them. Hopefully we’ll have more house calls and care occurring outside the clinic with remote monitoring and things like that. But the most important part of healthcare is the humanism. And so what I’m really excited about is AI helping to restore humanism in medical care. Because we’ve lost some of it over the last 20, 30 years as healthcare has become more corporate.  

So in the next two to five years, some things I expect to see is AI baked into more workflows. So AI scribes are going to become incredibly commonplace. I also think that there are huge opportunities to use those scribes to help reduce errors in diagnosis. So five or seven years from now, I think that when you’re speaking to your physician about your symptoms and other things, the scribe is going to be developing a differential diagnosis and helping recommend not only the right follow-up tests or imaging but even the physical exam findings that the doctor might want to look for in particular to help make a diagnosis.  

Dirty secret in healthcare, Peter, is that 50% of doctors are below average. It’s just math. And I think that the AI can help raise all of our doctors. So it’s like Lake Wobegon. They’re all above average. It has important implications for the workforce as you were saying. Do we need all visits to be with primary care doctors? Will mid-level providers augmented by AI be able to do as great a job as many of our physicians do? I think these are unanswered questions today that need to be explored. And then there was a really stimulating editorial in The New York Times recently by Dr. Eric Topol (opens in new tab), and he was waxing philosophic about a recent study that showed AI could interpret X-rays with 90% accuracy and radiologists actually achieve about 72% accuracy (opens in new tab).

LEE: Right. 

LONGHURST: The study looked at, how did the radiologists do with AI working together? And they got about 74% accuracy. So the doctors didn’t believe the AI. They thought that they were in the right, and the inference that Eric took that I agree with is that rather than always looking for ways to combine the two, we should be thinking about those tasks that are amenable to automation that could be offloaded with AI. So that our physicians are focused on the things that they’re great at, which is not only the humanism in healthcare but a lot of those edge cases we talked about. So let’s take mammogram screening as an example, chest X-ray screening. There’s going to be a point in the next five years where all first reads are being done by AI, and then it’s a subset of those that are positive that need to be reviewed by physicians. And that helps free up radiologists to do a lot of other things that we need them to do. 

LEE: Wow, that is really just such a great vision for the future. And I call some of this the flip, where even patient expectations on the use of technology flips from fear and uncertainty to, you know, you would try to do this without the technology? And I think you just really put a lot of color and detail on that. Well, Chris, thank you so much for this. On that groundbreaking paper from April of 2023, we’ll put a link to it. It’s a really great thing to read. And of course, you’ve published extensively since then. But I can’t thank you enough for just all the great work that you’re doing. It’s really changing medicine. 

[TRANSITION MUSIC]  

LONGHURST: Peter, can’t thank you enough for the opportunity to be here today and the partnership with Microsoft to make this all possible. 

LEE: I always love talking to Chris because he really is a prime example of an important breed of doctor, a doctor who has clinical experience but is also world-class tech geek. [LAUGHS] You know, it’s surprising to me, and pleasantly so, that the traditional gold standard of randomized trials that Chris has employed can be used to assess the viability of generative AI, not just for things like medical diagnosis, but even for seemingly mundane things like writing email notes to patients.  

The other surprise is that the use of AI, at least in the in-basket task, which involves doctors having to respond to emails from patients, doesn’t seem to save much time for doctors, even though the AI is drafting those notes. Doctors seem to love the reduced cognitive burden, and patients seem to appreciate the greater detail and friendliness that AI provides, but it’s not yet a big timesaver. And of course, the biggest surprise out of the conversation with Chris was his celebrated paper back two years ago now on the idea that AI notes are perceived by patients as being more empathetic than notes written by human doctors. Wow.

Let’s go ahead to my conversation with Dr. Sara Murray: 

LEE: Sara, I’m thrilled you’re here. Welcome. 

SARA MURRAY: Thank you so much for having me. 

LEE: You know, you have actually a lot of roles, and I know that’s not so uncommon for people at the leading academic medical institutions. But, you know, I think for our audience, understanding what a chief health AI officer does, an associate professor of clinical medicine—what does it all mean? And so to start, when you talk to someone, say, like your parents, how do you describe your job? You know, how do you spend a typical day at work? 

MURRAY: So first and foremost, I do always introduce myself as a physician because that’s how I identify, that’s how I trained. But in my current role, as the chief health AI officer, I’m really responsible for the vision and strategy for how we use trustworthy AI at scale to solve the biggest problems in our health system. And so I think there’s a couple key important points about that. One is that we have to be very careful that everything we’re doing in healthcare is trustworthy, meaning it’s safe, it’s ethical, it’s doing what we hope it’s doing, and it’s not causing any unexpected harm.  

And then, you know, second, we really want to be doing things that affect, you know, the population at large of the patients we’re taking care of. And so I think if you look historically at what’s happened with AI in healthcare, you’ve seen little studies here and there, but nothing broadly affecting or transforming how we deliver care. And I think now that we’re in this generative AI era, we have the tools to start thinking about how we’re doing that. And so that’s part of my role. 

LEE: And I’m assuming a chief health AI officer is not a role that has been around for a long time. Is this fairly new at UCSF, or has this particular job title been around?

MURRAY: No, it’s a relatively new role, actually. I came into this role about 18 months ago. I am the first chief health AI officer at UCSF, and I actually wrote the paper defining the role (opens in new tab) with Dr. Ashley Beecy, Dr. Chris Longhurst, Dr. Karandeep Singh, and Dr. Bob Wachter where we discuss what is this role in healthcare, why do we actually need it now, and what is this person accountable for. And I think it’s very important that as we roll these technologies out in health systems, we have someone who’s really accountable for thinking about, you know, whether we’re selecting the right tools and whether they’re being used in the right ways to impact our patients.  

LEE: It’s so interesting because I would say in the old days, you know, like five years ago, [LAUGHS] information technology in a hospital or health-system setting might be under the control and responsibility of a chief information officer, a CIO, or an IT, you know, chief. Or if it’s maybe some sort of medical device technology integration, maybe it’s some engineering-type of leader, a chief technology officer. But you’re different, and in fact the role that I think I would credit you with, sort of, making the blueprint for seems different because it’s actually doctors, practicing clinicians, who tend to inhabit these roles. Is there a reason why it’s different that way? Like, a typical CIO is not a clinician.

MURRAY: Yeah, so I report to our CIO. And I think that there’s a recognition that you need a clinician who really understands in practice how the tools can be deployed effectively. So it’s not enough to just understand the technology, but you really have to understand the use cases. And I think when you’re seeing physician chief health AI officers pop up around the country, it’s because they’re people who both understand the technology—not to the level you do obviously—but to some sufficient level and then understand how to use these tools in clinical care and where they can drive value and what the risks are in clinical care and that type of thing. And so I think it’d be hard for it not to be some type of clinician in this role.  

LEE: So I’m going to want to get into, you know, what’s really happening in clinic, but before that, I’ve been asking our guests about their “stages of AI grief,” [LAUGHS] as I like to put it. And for most people, I’ve been talking about the experiences and encounters with machine learning and AI before ChatGPT and then afterwards. And so can you tell us a little bit about, you know, how did you get into AI in the first place and what were your first encounters like? 

MURRAY: Yeah. So I actually started out as a health services researcher, and this was before we had electronic health records [EHR], when we were still writing our notes on carbon copy in the elevators, and a lot of the data we used was actually from claims data. And that was the kind of rich data source at the time, but as you know, that was very limited.  

And so when we went live with our electronic health record, I realized there was this tremendous opportunity to really use rich clinical data for research. And so I initially started collaborating with folks down at Stanford to do machine learning to identify, you know, rare diseases like lupus in the electronic health record but quickly realized there was this real gap in the health system for using data in an actionable way.  

And so I built what was initially our advanced analytics team, grew into our data science team, and is now our health AI team as our ability to use the data in more sophisticated ways evolved. But if we think about, I guess, the pre-generative era and my first encounter with AI or at least AI deployment in healthcare, you know, we initially, gosh, it was probably eight or nine years ago where we got access through our EHR vendor to some initial predictive tools, and these were relatively simple tools, but they were predicting things we care about in healthcare, like who’s not going make it to a clinic visit or how long patients are going stay in the hospital.  

And so there’s a lot of interest in, you know, predicting who might not make it to a clinic visit because we have big access issues with it being difficult for patients to get appointments, and the idea was that if you knew who wouldn’t show, you could actually put someone else in that slot, and it’s called overbooking. And so when we looked at the initial model, it was striking to me how risky it was for vulnerable patient populations because immediately it was obvious that this model was likely to overbook people by race, by body weight, by things that were clearly protected patient characteristics.  

And so we did a lot of work initially with that model and a lot of education around how these tools could be biased. But the risk existed, and as we continued to look at more of these models, we found there were a lot of issues with trustworthiness. You know, there was a length-of-stay prediction model that my team was able to outperform with a pair of dice. And when I talked to other systems about not implementing this model, you know, folks said, but it must be useful a little bit. I was like, actually, you know, if the dice is better, it’s not useful at all. [LAUGHS] 

LEE: Right!  

MURRAY: And so there was very little out there to frame this, but we quickly realized we have to start putting something together because there’s a lot of hype and there’s a lot of hope, but there’s also a lot of risk here. And so that was my pre-generative moment. 

LEE: You know, just before I get to your post-generative moment, you know, this story that you told, I sometimes refer to it as the healthcare IT world’s version of irrational exuberance. Because I think one thing that I’ve learned, and I have to say I’ve been guilty personally as a techie, you look at some of the problems that the world of healthcare faces, and to a techie first encountering this, a lot of it looks like common sense. Of course, we can build a model and predict these things.  

And you sort of don’t understand some of the realities, as you’ve described, that make this complicated. And at the same time, from healthcare professionals, I sometimes think they look at all of this dazzling machine learning magic and also are kind of overly optimistic that it can solve so many problems.  

And it does create this danger, this irrational exuberance, that both sides kind of get into a reinforcing cycle where they’re too quick to adopt technologies without thinking through the implications more carefully. I don’t know if that resonates with you at all. 

MURRAY: Yeah, totally. I think there’s a real educational opportunity here because it’s the “you don’t know what you don’t know” phenomenon. And so I do think there is a lot of work in healthcare to be done around, you know, people understanding the strengths and limitations of these tools because they’re not magic, but they are perceived to be magic.  

And likewise, you know, I think the tech world often doesn’t understand, you know, how healthcare is practiced and doesn’t think through these risks in the same way we do, right. So I know that some of the vulnerable patients who might’ve been overbooked by that algorithm are the people who I most need to see in clinic and are the people who would be, you know, most slighted if that they show up and the other patient shows up and now you have an overworked clinician. But I just think those are stages, you know, further down the pathway of utilization of these algorithms that people don’t think of when they’re initially developing them.  

And so one of the things we actually, you know, require in our AI oversight process is when folks come to the table with a tool, they have to have a plan for how it’s going to be used and operationalized. And a lot of things die right there, honestly, because folks have built a cool tool, but they don’t know who’s going to use it in clinic, who the clinical champions are, how it’ll be acted on, and you can’t really evaluate whether these tools are trustworthy unless you’ve thought through all of that.  

Because you can imagine using the same algorithm in dramatically different ways, right. If you’re using the no-show model to do targeted outreach and send people a free Lyft if they have transportation issues, that’s going to have very different outcomes than overbooking folks.  

LEE: It’s so interesting and I’m going to want to get back to this topic because I think it also speaks to the challenges of how do you integrate technologies into the daily workflow of a clinic. And I know this is something you think about a lot, but let’s get back now to my original question about your AI moments. So now November 2022, ChatGPT happens, and what is your encounter with this new technology? 

MURRAY: Yeah. So I used to be on MedTwitter, or I still am actually; it’s just not as active anymore. But I would say, you know, MedTwitter went crazy after ChatGPT was initially released and it was largely filled with catchy poems and people, you know, having fun …  

LEE: [LAUGHS] Guilty. 

MURRAY: Yeah, exactly. I still use poems. And people having fun trying to make it hallucinate. And so, you know, I went—I was guilty of that, as well—and so one of the things I initially did was I asked it to do something crazy. So I asked it, draft me a letter for a prior authorization request for a drug called Apixaban, which is a blood thinner, to treat insomnia. And if you practice clinical medicine, you know that we would never use a blood thinner to treat insomnia. But it wrote me such a compelling letter that I actually went back to PubMed and I made sure that I wasn’t missing anything, like some unexpected side effect. I wasn’t missing anything and in fact it was hallucination. And so at that moment I said, this is very promising technology, but this is still a party trick.  

LEE: Yeah. 

MURRAY: A few months later, I went and did the exact same prompt, and I got a lecture, instead of a draft, about how it would be unethical [LAUGHTER] and unsafe for me to draft such a request. And so, you know, I realized these tools were rapidly evolving, and the game was just going to be changing very quickly. I think the other thing that, you know, we’ve never seen before is the deployment of a technology at scale like we have with AI scribes.  

So this is a technology that was in its infancy, you know, two years ago, and is now largely a commodity deployed at scale across many health systems. A very short period of time. There’s been no government incentives for people to do this. And so it clearly works well enough to be used in clinics. And I think these tools, you know, like AI scribes, have the opportunity to really undo a lot of the harm that the electronic health record implementations were perceived to have caused.  

LEE: What is a scribe, first off? 

MURRAY: Yeah, so AI scribes or, as we’re now calling them, AI assistants or ambient assistants, are tools that essentially listen to your clinical interaction. We record them with the permission of a patient, with consent, and then they draft a clinical note, and they can also draft other things like the patient instructions. And the idea is those drafts are very helpful to clinicians, and they have to review them and edit them, but it saves a lot of the furious typing that was previously happening during patient encounters. 

LEE: We have been talking also to Chris Longhurst, your colleague at UC San Diego, and, you know, he mentions also the importance of having appropriate billing codes in those notes, which is yet another burden. Of course, when Carey, Zak, and I wrote our book, we predicted that AI scribes would get better and would find wider use because of the improvement in technology. Let me start by asking, do you yourself use an AI scribe? 

MURRAY: So I do not use it yet because I’m an inpatient doctor, and we have deployed them to all ambulatory clinic doctors because that’s where the technology is tried and true. So we’re looking now to deploy it in the inpatient setting, but we’re doing very initial testing. 

LEE: And what are the reasons for not integrating it into the inpatient setting? 

MURRAY: Well, there’s two things actually. Most inpatient documentation work, I would say, is follow-up documentation. And so you’re often taking your prior notes and making small changes to it as you change the care from day to day. And so the tools are just, all of the companies are working on this, but right now they don’t really incorporate your prior documentation or note when they draft your note for today.  

The second reason is that a lot of the decision-making that we do in the inpatient setting is asynchronous with the patient. So we’ll often have a conversation in the morning with the patient in their room, and then I’ll see some labs come back and I’ll make decisions and act on those labs and give the patient a call later and let them know what’s going on. And so it’s not a very succinct encounter, and so the technology is going to have to be a little bit different to work in that case, I think. 

LEE: Right, and so these are distinct workflows from the ambulatory setting, where it is the classic, you’re sitting with a patient in an exam room having an encounter. 

MURRAY: Mm-hmm. Exactly. And all your decisions are made there. And I would say it’s also different from nursing. We’re also looking at deploying these tools to nurses. But a lot of their documentation is in something called flowsheets. They write in columns, you know, specific numbers, and so for them to use these tools, they’d have to start saying to the patient, sounds like your pain is a five. Your blood pressure is 120 over 60. And so those are different workflows they’d have to adopt to use the tools. 

LEE: So you’ve been in the position of having to oversee the integration of AI scribes into UCSF health. From your perspective how were clinical staff actually viewing all of this?  

MURRAY: So I would say clinical staff are largely very excited, receptive, and would like us to move faster. And in fact, I gave a town hall to UCSF, and all of the comments were, when is this coming for APPs [advanced practice providers]? When is this coming for allied health professionals? And so people want this across healthcare. It’s not just doctors. But at the same time, you know, I think there’s a technology adoption curve, and about half of our ambulatory clinicians have signed up and about a third of them are now using the tool. And so we are now doing outreach to figure out who is not using it, why aren’t they using it, and what can we do to increase adoption. Or are there true barriers that we need to help folks overcome? 

LEE: And when you do these things, of course, there are risks. And as you were mentioning several times before, you were really concerned about hallucinations, about trustworthiness. So what were the steps that you took at UCSF to make these integrations happen? 

MURRAY: Yeah, so we have a AI oversight process for all tools that come into our healthcare with AI, regardless of where they’re coming from. So industry tools, internally developed tools, and research tools come through the same process. And we have a committee that is quite multidisciplinary. We have health system leaders, data scientists, bioethicists, researchers, health-equity experts. And through our process, we break down the AI lifecycle to a couple key places where these tools come for committee review. And so for every AI deployment, we expect people to establish performance metrics, fairness metrics, and we help them with figuring out what those things should be.

We were also fortunate to receive a donation to build a AI monitoring platform, which we’re working on now at UCSF. We call it our Impact Monitoring Platform for AI and Clinical Care, IMPACC, and AI scribes is actually our first use case. And so on that platform, we have a metric adjudication process where we’ve established, you know, what do we really care about for our health system executive leaders, what do we really care about for, you know, ensuring safety and trustworthiness, and then, you know, what are our patients going to want to know? Because we want to also be transparent with our patients about the use of these tools. And so we have processes for doing all this work.  

I think the challenge is actually how we scale these processes as more and more tools come through because as you could imagine, a lot of conversation with a lot of stakeholders to figure out what and how we measure things right now. 

LEE: And so there’s so much to get into there, but I actually want to zoom in on the actual experience that doctors, nurses, and patients are having. And, you know, do you find that AI is meeting expectations? Is it making a difference, positive or negative, in people’s lives? And what kinds of potential surprises are people encountering? 

MURRAY: Mm-hmm. So we’re collecting data in a couple of ways. First, we’re surveying clinicians before and after their experience, and we are hearing from folks that they feel like their clinic work is more manageable, that they’re more able to finish their documentation in a timely fashion.  

And then we’re looking at actual metrics that we can extract from the EHR around how long people are spending doing things. And that data is largely aligning with what people are reporting, although the caveat is they’re not saving enough time for us to have them see more patients. And so we’ve been very explicit at UCSF around making it clear that this is a tool to improve experience and not to improve efficiency.  

So we’re not expecting for people to see more patients as a result of using this tool. We want their clinic experience to be more meaningful. But then the other thing that’s interesting that folks share is this tremendous relief of cognitive burden that folks feel when using this tool. So they may have been really efficient before. You know, they could get all their work done. They could type while they were talking to their patients. But they didn’t actually, you know, get to look at their patients eye to eye and have the meaningful conversation that people went into medicine for. And so we’re hearing that, as well.  

And I think one of the things that’s going to be important to us is actually measuring that moving forward. And that is matched by some of the feedback we’re getting from patients. So we have quotes from patients where they’ve said, you know, my doctor is using this new tool and it’s amazing. We’re just having eye-to-eye conversations. Keep using it. So I think that’s really important. 

LEE: I’ve been pushing my own primary care doctor to get into this because I really depend on her. I love her dearly, but we never … I’m always looking at her back as she’s typing at a computer during our encounters. [LAUGHS] 

So, Sara, while we’re talking about efficiency, and at least the early evidence doesn’t show clear efficiency gains, it does actually beg the question about how or why health systems, many of which are financially, you know, not swimming in money, how or why they could adopt these things.  

And then we could also even imagine that there are even more important applications in the future that, you know, might require quite a bit of expense on developers as well as procurers of these things. You know, what’s your point of view on the I guess we would call this the ROI question about AI? 

MURRAY: Mm-hmm. I think this is a really challenging area because return on investment is very important to health systems that are trying to figure out how to spend a limited budget to improve care delivery. And so I think we’ve started to see a lot of small use cases that prove this technology could likely be beneficial.  

So there are use cases that you may have heard of from Dr. Longhurst around drafting responses to patient messages, for example, where we’ve seen that this technology is helpful but doesn’t get us all the way there. And that’s because these technologies are actually quite expensive. And when you want to process large amounts of data, that’s called tokens, and tokens cost money.  

And so I think one of the challenges when we envision the future of healthcare, we’re not really envisioning the expense of querying the entire medical record through a large language model. And we’re going to have to build systems from a technology standpoint that can do that work in a more affordable way for us to be able to deliver really high-value use cases to clinicians that involve processing that.  

And so those are use cases like summarizing large parts of the patient’s medical record, providing really meaningful clinical decision support that takes into account the patient’s entire medical history. We haven’t seen those types of use cases really come into being yet, largely because, you know, they’re technically a bit more complex to do well and they’re expensive, but they’re completely feasible. 

LEE: Yeah. You know, what you’re saying really resonates so strongly from the tech industry’s perspective. You know, one way that that problem manifests itself is shareholders in big tech companies like ours more or less expect … they’re paying a high premium—a high multiple on the share price—because they’re expecting our revenues to grow at very spectacular rates, double digit rates. But that isn’t obviously compatible with how healthcare works and the healthcare business works. It doesn’t grow, you know, at 30% year over year or anything like that.

And so how to make these things financially make sense for all comers. And it’s sort of part and parcel also with the problem that sometimes efficiency gains in healthcare just translate into heavier caseloads for doctors, which isn’t obviously the best outcome either. And so in a way, I think it’s another aspect of the work on impact and trustworthiness when we think about technology, at all, in healthcare.  

MURRAY: Mm-hmm. I think that’s right. And I think, you know, if you look at the difference between the AI scribe market and the rest of the summarization work that’s largely happening within the electronic health record, in the AI scribe market, you have a lot of independent companies, and they all are competing to be the best. And so because of that, we’re seeing the technology get more efficient, cheaper. There’s just a lot of investment in that space.  

Whereas like with the electronic health record providers, they’re also invested in really providing us with these tools, but it’s not their main priority. They’re delivering an entire electronic health record, and they also have to do it in a way that is affordable for, you know, all kinds of health systems, big UCSF health systems, smaller settings, and so there’s a real tension, I think, between delivering good-enough tools and truly transformative tools.  

LEE: So I want to go back for a minute to this idea of cognitive burden that you described. When we talk about cognitive burden, it’s often in the context of paperwork, right. There are maybe referral letters, after-visit notes, all of these things. How do you see these AI tools progressing with respect to that stream of different administrative tasks? 

MURRAY: These tools are going to continue to be optimized to do more and more tasks for us. So with AI scribes, for example, you know, we’re starting to look at whether it can draft the billing and coding information for the clinician, which is a tedious task with many clicks.  

These tools are poised to start pending orders based on the conversation. Again, a tedious task. All of this with clinician oversight. But I think as we move from them being AI scribes to AI assistants, it’s going to be like a helper on the side for clinicians doing more and more work so they can really focus on the conversations, the shared decision-making, and the reason they went into medicine really. 

LEE: Yeah, let me, since you mentioned AI assistants and that’s such an interesting word and it does connect with something that was apparent to us even, you know, as we were writing the book, which is this phenomenon that these AI systems might make mistakes.  

They might be guilty of making biased decisions or showing bias, and yet they at the same time seem incredibly effective at spotting other people’s mistakes or other people’s biased decisions. And so is there a point where these AI scribes do become AI assistants, that they’re sort of looking over a doctor’s shoulder and saying, “Hey, did you think about something else?” or “Hey, you know, maybe you’re wrong about a certain diagnosis?” 

MURRAY: Mm-hmm. I mean, absolutely. You’re just really talking about combining technologies that already exist into a more streamlined clinical care experience, right. So you can—and I already do this when I’m on rounds—I’ll kind of give the case to ChatGPT if it’s a complex case, and I’ll say, “Here’s how I’m thinking about it; are there other things?” And it’ll give me additional ideas that are sometimes useful and sometimes not but often useful, and I’ll integrate them into my conversation about the patient.  

I think all of these companies are thinking about that. You know, how do we integrate more clinical decision-making into the process? I think it’s just, you know, healthcare is always a little bit behind the technology industry in general, to say the least. And so it’s kind of one step at a time, and all of these use cases need a lot of validation. There’s regulatory issues, and so I think it’s going to take time for us to get there. 

LEE: Should I be impressed or concerned that the chief health AI officer at UC San Francisco Health is using ChatGPT off label? 

MURRAY: [LAUGHS] Well, I actually, every time I go on service, I encourage my residents to use it because I think we need to learn how to use these technologies. And, you know, when our medical education leaders start thinking about how do we teach students to use these, we don’t know how to teach students to use them if we’re not using them ourselves, right. And so I’ve learned a lot about what I perceive the strengths and limitations of the tools are.  

And I think … but you know, one of the things that we’ve learned is—and you’ve written about this in your book—but the prompting really matters. And so I had a resident ask it for a differential for abnormal liver tests. But in asking for that differential, there is a key important blood finding, something called eosinophilia. It’s a type of blood cell that was mildly, mildly elevated, and they didn’t know it. So they didn’t give it in the prompt, and as a result, they didn’t get the right differential, but it wasn’t actually ChatGPT’s fault. It just didn’t get the right information because the trainee didn’t recognize the right information. And so I think there’s a lot to learn as we practice using these tools clinically. So I’m not ashamed of it. [LAUGHS] 

LEE: [LAUGHS] Yeah. Well, in fact, I think my coauthor Carey Goldberg would find what you said really validating because in our book, she actually wrote this fictional account of what it might be like in the future. And this medical resident was also using an AI chatbot off label for pretty much the same kinds of purposes. And it’s these kinds of things that, you know, it seems like might be coming next. 

MURRAY: I mean, medicine, the practice of medicine, is a very imperfect science, and so, you know, when we have a difficult case, I might sit in the workroom with my colleagues and run it by people. And everyone has different thoughts and opinions on, you know, things I should check for. And so I think this is just one other resource where you can kind of run cases, obviously, just reviewing all of the outputs yourself. 

LEE: All right, so we’re running short on time and so I want to be a little provocative at the end here. And since we’ve gotten into AI assistants, two questions: First off, do we get to a point in the near future when it would be unthinkable and maybe even bordering on malpractice for a doctor not to use AI assistants in his or her daily work? 

MURRAY: So it’s possible that we see that in the future. We don’t see it right now. And that’s part of the reason we don’t force this on people. So we see AI scribes or AI assistants as a tool we offer to people to improve their daily work because we don’t have sufficient data that the outcomes are markedly better from using these tools.  

I think there is a future where specific, you know, tools do actually improve outcomes. And then their use should be incentivized either through, you know, CMS [Centers for Medicare & Medicaid Services] or other systems to ensure that, you know, we’re delivering standard of care. But we’re not yet at the place where any of these tools are standard of care, which means they should be used to practice good medicine. 

LEE: And I think I would say that it’s the work of people like you that would make it possible for these things to become standard of care. And so now, final provocation. It must have crossed your mind through all of this, the possibility that AI might replace doctors in some ways. What are your thoughts? 

MURRAY: I think we’re a long way from that happening, honestly. And I think even when I talk to my colleagues in radiology about this, where I perceive, as an internist, they might be the most replaceable, there’s a million reasons why that’s not the case. And so I think these tools are going to augment our work. They’re going to help us streamline access for patients. They’re going to maybe change what clinicians have to do, but I don’t think they’re going fully replace doctors. There’s just too much complexity and nuance in providing clinical care for these tools to do that work fully. 

LEE: Yeah, I think you’re right. And actually, you know, I think there’s plenty of evidence because in the history of modern medicine, we actually haven’t seen technology replace human doctors. Maybe you could say that we don’t use barbers for bloodletting anymore because of technology. But I think, as you say, we’re at least a long ways away.  

MURRAY: Yeah. 

LEE: Sara, this has been just a great conversation. And thank you for the great work that you’re doing, you know, and for being so open with us on your personal use of AI, but also how you see the adoption of AI in our health system. 

[TRANSITION MUSIC] 

MURRAY: Thank you, it was really great talking with you. 

LEE: I get so much out of talking to Sara. Every time, she manages to get me refocused on two things: the quality of the user experience and the importance of trust in any new technology that is brought into the clinic. I felt like there were several good takeaways from the conversation. One is that she really validated some predictions that Carey, Zak, and I made in our book, first and foremost, that automated note taking would be a highly desirable and practical reality. The other validation is Sara revealing that even she uses ChatGPT as a daily assistant in her clinical work, something that we guessed would happen in the book, but we weren’t really sure since health systems oftentimes are very locked down when it comes to the use of technological tools.  

And of course, maybe the biggest thing about Sara’s work is her role in defining a new type of job in healthcare, the health AI officer. This is something that Carey, Zak, and I didn’t see coming at all, but in retrospect, makes all the sense in the world. Taken together, these two conversations really showed that we were on the right track in the book. AI has made its way into day-to-day life and work in the clinic, and both doctors and patients seem to be appreciating it.  

[MUSIC TRANSITIONS TO THEME] 

I’d like to extend another big thank you to Chris and Sara for joining me on the show and sharing their insights. And to our listeners, thank you for coming along for the ride. We have some really great conversations planned for the coming episodes. We’ll delve into how patients are using generative AI for their own healthcare, the hype and reality of AI drug discovery, and more. We hope you’ll continue to tune in. Until next time. 

[MUSIC FADES] 

[1] The paper, “Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum (opens in new tab),” presented the findings of a cross-sectional study that compared doctor responses to patient questions posted to the public forum r/AskDocs on Reddit to responses to the same questions generated by ChatGPT.

[2] Dr. John Ayers is corresponding author of the April 2023 paper “Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.” (opens in new tab)

[3] These findings are detailed in the paper “AI-Generated Draft Replies Integrated Into Health Records and Physicians’ Electronic Communication,” (opens in new tab) published in JAMA Network Open.

The post The reality of generative AI in the clinic appeared first on Microsoft Research.

Read More

Claimify: Extracting high-quality claims from language model outputs

Claimify: Extracting high-quality claims from language model outputs

Gradient background transitioning from blue to pink with two white icons. The left icon depicts a network or molecule structure with interconnected nodes, and the right icon shows a laptop and the outline of a person.

While large language models (LLMs) are capable of synthesizing vast amounts of information, they sometimes produce inaccurate or unsubstantiated content. To mitigate this risk, tools like Azure AI’s Groundedness Detection (opens in new tab) can be used to verify LLM outputs. 

A common strategy for fact-checking LLM-generated texts – especially complex, highly detailed outputs – is claim extraction: instead of evaluating the entire text at once, it’s broken down into simple factual statements that can be verified independently. However, the effectiveness of this approach depends on the quality of the extracted claims: if they are inaccurate or incomplete, the fact-checking results may be compromised. 

Our paper “Towards Effective Extraction and Evaluation of Factual Claims” proposes a framework for evaluating claim extraction and introduces Claimify, a novel LLM-based method that outperforms prior solutions. In this blog post, we discuss common claim extraction issues and explore Claimify’s approach.

Key issues in claim extraction

As a case study, we prompted OpenAI’s GPT-4o model to “provide an overview of challenges in emerging markets” based on a collection of news articles. Here is its output:

# Emerging market challenges
Emerging markets face a myriad of challenges that span economic, social, and environmental dimensions. These challenges are often interconnected, exacerbating the difficulties faced by these regions.

## Economic instability and inflation
Several emerging markets are grappling with severe economic instability. For instance, Argentina’s rampant inflation, with monthly rates reaching as high as 25.5%, has made many goods unobtainable and plunged the value of the currency, causing severe economic hardship. Some experts estimate that the annual inflation rate could potentially double to 300%, while others predict even higher rates.

## Food insecurity and agricultural challenges
Food insecurity is another critical issue in emerging markets. Nigeria, for example, is striving to become self-sufficient in wheat production but is hindered by climate change and violence, exacerbated by high grain prices due to the suspension of the Black Sea Grain Initiative. Climate change has played a pivotal role in creating food insecurity and economic instability in farming-dependent economies, such as Zambia and Mozambique.

## Natural disasters and public health emergencies
Public health crises, as demonstrated by the experience of Derna, Libya, remain a significant concern in emerging markets. Catastrophic flooding caused by the collapse of two dams has led to significant destruction, complicating access to healthcare. The U.N. found that the resulting contaminated water caused many residents to fall ill, highlighting the need for improved water management. Countries like Afghanistan and Sudan have experienced similar challenges to those of Libya.

## Conclusion
Addressing emerging market challenges will require comprehensive strategies that consider economic stability, food security, and public health. International cooperation and targeted interventions are crucial in fostering sustainable development in emerging markets.

We then applied a simple prompt[1] that extracted the following claims from the answer:

1. Emerging markets face economic challenges.
2. Emerging markets face social challenges.
3. Emerging markets face environmental challenges.
4. Argentina experiences severe economic instability.
5. Argentina has a monthly inflation rate as high as 25.5%.
6. Argentina’s inflation has made many goods unobtainable.
7. Argentina’s currency value has plunged.
8. Some experts estimate Argentina’s annual inflation rate could reach 300%.
9. Food insecurity is a critical issue in emerging markets.
10. Nigeria is striving to become self-sufficient in wheat production.
11. Nigeria’s wheat production is hindered by climate change.
12. Nigeria’s wheat production is hindered by violence.
13. High grain prices affect Nigeria due to the suspension of the Black Sea Grain Initiative.
14. Climate change creates food insecurity in Zambia.
15. Climate change creates food insecurity in Mozambique.
16. Climate change causes economic instability in farming-dependent economies.
17. Public health crises are a concern in emerging markets.
18. Derna, Libya, experienced catastrophic flooding.
19. The collapse of two dams caused flooding in Derna, Libya.
20. Flooding in Derna, Libya, complicated access to healthcare.
21. The U.N. found contaminated water in Derna, Libya.
22. Contaminated water in Derna, Libya, caused many residents to fall ill.
23. Afghanistan has experienced challenges similar to those in Libya.
24. Sudan has experienced challenges similar to those in Libya.
25. Addressing emerging market challenges requires comprehensive strategies.
26. Comprehensive strategies should consider economic stability.
27. Comprehensive strategies should consider food security.
28. Comprehensive strategies should consider public health.
29. International cooperation is crucial for sustainable development in emerging markets.
30. Targeted interventions are crucial for sustainable development in emerging markets.

These claims appear to reflect the answer, but a closer look reveals four critical issues:

# Issue Example
1 Some claims are not factual statements that can be verified as true or false. Excerpt:Addressing emerging market challenges will require comprehensive strategies that consider economic stability, food security, and public health.

Claims:

  • Addressing emerging market challenges requires comprehensive strategies.
  • Comprehensive strategies should consider economic stability.
  • Comprehensive strategies should consider food security.
  • Comprehensive strategies should consider public health.

Explanation: These claims are not verifiable because they are opinions.

2 Some claims are missing or incomplete. Excerpt:Argentina’s rampant inflation, with monthly rates reaching as high as 25.5%, has made many goods unobtainable and plunged the value of the currency, causing severe economic hardship. Some experts estimate that the annual inflation rate could potentially double to 300%, while others predict even higher rates.

Claims:

  • Argentina has a monthly inflation rate as high as 25.5%.
  • Argentina’s inflation has made many goods unobtainable.
  • Argentina’s currency value has plunged.
  • Some experts estimate Argentina’s annual inflation rate could reach 300%.

Explanation: The phrases “causing severe economic hardship” and “others predict even higher rates” are not reflected in any of the claims. The third claim also omits the fact that inflation caused the currency depreciation.

3 Some claims are inaccurate. Excerpt: The U.N. found that the resulting contaminated water caused many residents to fall ill, highlighting the need for improved water management.”

Claims:

  • The U.N. found contaminated water in Derna, Libya.
  • Contaminated water in Derna, Libya, caused many residents to fall ill.

Explanation: The first claim is inaccurate because the U.N. found the link between contaminated water and illness, not the contaminated water itself. The second claim also misrepresents the sentence since it shifts the meaning from a viewpoint of a specific entity (the U.N.) to a general assertion about the effects of contaminated water in Derna, Libya.

4 Some claims cannot be understood without additional context. Excerpt: Countries like Afghanistan and Sudan have experienced similar challenges to those of Libya.

Claims:

  • Afghanistan has experienced challenges similar to those in Libya.
  • Sudan has experienced challenges similar to those in Libya.

Explanation: These claims cannot be understood on their own because “those” is not defined.

Introducing Claimify

The case study highlights that claim extraction is surprisingly error-prone. Our paper demonstrates that the issues identified above are common across LLM-based claim extraction methods. To minimize these errors, we created a system called Claimify[2].

Core principles

Claimify is an LLM-based claim extraction system built on the following principles:

# Principle Example
1 The claims should capture all verifiable content in the source text and exclude unverifiable content. In the sentence “The partnership between John and Jane illustrates the importance of collaboration,” the only verifiable content is the existence of a partnership between John and Jane. The rest is subjective interpretation.
2 Each claim should be entailed (i.e., fully supported) by the source text. Consider the sentence “Governments are curtailing emissions from cars and trucks, which are the largest source of greenhouse gases from transportation.” The following claims are incorrect:

  • Cars are the largest source of greenhouse gases from transportation.
  • Trucks are the largest source of greenhouse gases from transportation.

The sentence attributes the highest emissions to cars and trucks collectively, not individually.

3 Each claim should be understandable on its own, without additional context. The claim “They will update the policy next year” is not understandable on its own because it’s unclear what “They,” “the policy,” and “next year” refer to.
4 Each claim should minimize the risk of excluding critical context. Suppose the claim “The World Trade Organization has supported trade barriers” was extracted from the sentence “An exception to the World Trade Organization’s open-market philosophy is its history of supporting trade barriers when member countries have failed to comply with their obligations.” A fact-checking system would likely classify the claim as false, since there is extensive evidence that the WTO aims to reduce trade barriers. However, if the claim had specified that the WTO has supported trade barriers “when member countries have failed to comply with their obligations,” it would likely have been classified as true. This example demonstrates that missing context can distort the fact-checking verdict.
5 The system should flag cases where ambiguity cannot be resolved. The sentence “AI has advanced renewable energy and sustainable agriculture at Company A and Company B” has two mutually exclusive interpretations:

  • AI has advanced renewable energy and sustainable agriculture at both Company A and Company B.
  • AI has advanced renewable energy at Company A and sustainable agriculture at Company B.

If the context does not clearly indicate that one of these interpretations is correct, the system should flag the ambiguity instead of picking one interpretation arbitrarily.

Implementation

Claimify accepts a question-answer pair as input and performs claim extraction in four stages, illustrated in Figure 1:

# Stage Description
1 Sentence splitting and context creation The answer is split into sentences, with “context” – a configurable combination of surrounding sentences and metadata (e.g., the header hierarchy in a Markdown-style answer) – created for each sentence.
2 Selection An LLM identifies sentences that do not contain verifiable content. These sentences are labeled “No verifiable claims” and excluded from subsequent stages. When sentences contain verifiable and unverifiable components, the LLM rewrites the sentence, retaining only the verifiable components.
3 Disambiguation For sentences that passed the Selection stage, an LLM detects ambiguity and determines if it can be resolved using the context. If all ambiguity is resolvable, the LLM returns a disambiguated version of the sentence. Otherwise, the sentence is labeled “Cannot be disambiguated” and excluded from the Decomposition stage.
4 Decomposition For sentences that are unambiguous or were disambiguated, an LLM creates standalone claims that preserve critical context. If no claims are extracted, the sentence is labeled “No verifiable claims.”
A flowchart outlining Claimify’s stages for extracting claims from a question-answer pair. The process begins by splitting the answer into sentences and creating context. Next, the Selection stage asks if a sentence contains any verifiable content. If no, the sentence is labeled
Figure 1: Overview of Claimify’s stages

Results

In our paper, we demonstrate that Claimify outperforms existing LLM-based methods[3]. Specifically, we show that: (1) 99% of claims extracted by Claimify are entailed by their source sentence, (2) Claimify strikes the best balance between including verifiable content and excluding unverifiable content, and (3) Claimify is least likely to omit context critical to the fact-checking verdict.

For the above case study on challenges in emerging markets, here are Claimify’s outputs, with source sentences preceded by a letter and claims numbered[4]:

A. Several emerging markets are grappling with severe economic instability.
1. Several emerging markets are grappling with severe economic instability.

B. For instance, Argentina’s rampant inflation, with monthly rates reaching as high as 25.5%, has made many goods unobtainable and plunged the value of the currency, causing severe economic hardship.
1. Argentina has rampant inflation.
2. The monthly inflation rates in Argentina have reached as high as 25.5%.
3. Inflation has made many goods unobtainable in Argentina.
4. Inflation has plunged the value of the currency in Argentina.
5. Inflation has caused severe economic hardship in Argentina.

C. Some experts estimate that the annual inflation rate could potentially double to 300%, while others predict even higher rates.
1. Some experts estimate that Argentina’s annual inflation rate could double to 300% in the future.
2. Some experts predict that Argentina’s annual inflation rate could be higher than 300% in the future.

D. Nigeria, for example, is striving to become self-sufficient in wheat production but is hindered by climate change and violence, exacerbated by high grain prices due to the suspension of the Black Sea Grain Initiative.
1. Nigeria is striving to become self-sufficient in wheat production.
2. Nigeria is hindered by climate change in becoming self-sufficient in wheat production.
3. Nigeria is hindered by violence in becoming self-sufficient in wheat production.
4. High grain prices exacerbate the hindrance to Nigeria’s efforts to become self-sufficient in wheat production.
5. The suspension of the Black Sea Grain Initiative is a reason for high grain prices.

E. Climate change has played a pivotal role in creating food insecurity and economic instability in farming-dependent economies, such as Zambia and Mozambique.
1. Climate change has played a role in creating food insecurity in farming-dependent economies.
2. Zambia is a farming-dependent economy where climate change has played a role in creating food insecurity.
3. Mozambique is a farming-dependent economy where climate change has played a role in creating food insecurity.
4. Climate change has played a role in creating economic instability in farming-dependent economies.
5. Zambia is a farming-dependent economy where climate change has played a role in creating economic instability.
6. Mozambique is a farming-dependent economy where climate change has played a role in creating economic instability.

F. Public health crises, as demonstrated by the experience of Derna, Libya, remain a significant concern in emerging markets.
1. Public health crises are a concern in emerging markets.
2. Derna, Libya, is an example of a public health crisis in emerging markets.

G. Catastrophic flooding caused by the collapse of two dams has led to significant destruction, complicating access to healthcare.
1. There was catastrophic flooding in Derna, Libya.
2. The flooding in Derna, Libya, was caused by the collapse of two dams.
3. The flooding in Derna, Libya, has led to significant destruction.
4. The flooding in Derna, Libya, has complicated access to healthcare.

H. Countries like Afghanistan and Sudan have experienced similar challenges to those of Libya.
1. Afghanistan has experienced challenges related to public health crises.
2. Afghanistan has experienced challenges related to catastrophic flooding.
3. Afghanistan has experienced challenges related to contaminated water.
4. Sudan has experienced challenges related to public health crises.
5. Sudan has experienced challenges related to catastrophic flooding.
6. Sudan has experienced challenges related to contaminated water.

Note that the baseline prompt extracted several claims from the sentence “The U.N. found that the resulting contaminated water caused many residents to fall ill, highlighting the need for improved water management,” but it ignored the phrase “highlighting the need for improved water management.” It also failed to capture that the contaminated water resulted from flooding, as implied by “resulting” in the original sentence.

Claimify took a different approach. First, it found two instances of ambiguity – “resulting contaminated water” and “many residents” – that it determined could be resolved using the context. Here’s an excerpt from its reasoning: “…the context specifies that the contaminated water is a result of the catastrophic flooding in Derna, Libya, and the residents are those of Derna, Libya.

However, it also found an instance of ambiguity – “highlighting the need for improved water management” – where it concluded that the context does not definitively support a single interpretation: “The sentence could be interpreted as: (1) The U.N. found that the contaminated water caused illness and also highlighted the need for improved water management, (2) The U.N. only found that the contaminated water caused illness, while the need for improved water management is an implication or conclusion drawn by the writer. Readers … would likely fail to reach consensus about the correct interpretation of this ambiguity.” As a result, Claimify labeled the sentence “Cannot be disambiguated” at the Disambiguation stage and did not proceed to the Decomposition stage. 

To the best of our knowledge, Claimify is the first claim extraction system that identifies when the source text has multiple possible interpretations and extracts claims only when there is high confidence in the correct interpretation.

Next steps

We’re currently working on new methods for evaluating LLM-generated texts. We anticipate that the high-quality claims extracted by Claimify will help not only in verifying the veracity of LLM outputs, but also in assessing their overall quality – especially when gold-standard references are difficult to create (e.g., long-form texts where people may disagree on what defines “good” content). For example, we recently used Claimify to evaluate the comprehensiveness and diversity of answers generated by GraphRAG, showing that GraphRAG outperforms traditional Retrieval Augmented Generation (RAG) in these areas.

For an in-depth discussion of Claimify and our evaluation framework, please see our paper “Towards Effective Extraction and Evaluation of Factual Claims.”


[1] (opens in new tab) We used the “proposition chunking” prompt from NirDiamant’s RAG Techniques repository (opens in new tab). We generated multiple responses using GPT-4o, then picked the response that was most representative of the samples.

[2] Claimify is currently used for research purposes only and is not available commercially.

[3] (opens in new tab) We benchmarked Claimify against VeriScore (opens in new tab), DnD (opens in new tab), SAFE (opens in new tab), AFaCTA (opens in new tab), and Factcheck-GPT (opens in new tab).

[4] The outputs were generated using GPT-4o. Sentences not shown were either labeled “No verifiable claims” or “Cannot be disambiguated.”

The post Claimify: Extracting high-quality claims from language model outputs appeared first on Microsoft Research.

Read More

Metasurface: Unlocking the future of wireless sensing and communication

Metasurface: Unlocking the future of wireless sensing and communication

The image features three white icons on a gradient background transitioning from blue on the left to green on the right. The first icon, located on the left, represents a Wi-Fi signal with curved lines radiating from a central point. The middle icon depicts a satellite with solar panels and an antenna emitting waves. The third icon, on the right, shows a bar chart with ascending bars indicating signal strength.

As the demand for faster, more reliable wireless communication continues to grow, traditional systems face limitations in efficiency and adaptability. To keep up with evolving needs, researchers are investigating new ways to manipulate electromagnetic waves to improve wireless performance.

One solution involves metasurfaces—engineered materials that can control wave propagation in unprecedented ways. By dynamically shaping and directing electromagnetic waves, they can overcome the constraints of conventional wireless systems.

Building on these capabilities, we are developing metasurfaces for a wide range of wireless application scenarios. Notably, we have developed metasurfaces for enhancing low earth orbit satellite communication, optimizing acoustic sensing, realizing acoustic and mmWave imaging using commodity devices. More recently, we have designed metasurfaces to enable indoor Global Navigation Satellite System (GNSS), offer good mmWave coverage over a target environment, optimize heat distribution inside a microwave oven, and deliver directional sound to a user without a headphone.

All these works, published at top networking conferences, including MobiCom 2023 & 2024, MobiSys 2024 & 2025, and NSDI 2023, demonstrate the transformative potential of metasurfaces in advancing wireless communication and sensing technologies. This blog post explores some of these technologies in more detail.

Microsoft research podcast

What’s Your Story: Lex Story

Model maker and fabricator Lex Story helps bring research to life through prototyping. He discusses his take on failure; the encouragement and advice that has supported his pursuit of art and science; and the sabbatical that might inspire his next career move.


Metasurfaces optimize GNSS for accurate indoor positioning

While GNSS is widely used for outdoor positioning and navigation, its indoor performance is often hindered by signal blockage, reflection, and attenuation caused by physical obstacles. Additional technologies like Wi-Fi and Bluetooth Low Energy (BLE) are often employed to address these issues. However, these solutions require extra infrastructure, are costly, and are complicated to deploy. Accurate positioning also typically depends on specialized hardware and software on mobile devices. 

Despite these challenges, GNSS signals hold promise for accurate indoor positioning. By leveraging the vast number of available satellites, GNSS-based solutions eliminate the need for base station deployment and maintenance required by Wi-Fi and BLE systems. This approach also allows seamless integration between indoor and outdoor environments, supporting continuous positioning in scenarios like guiding smart vehicles through indoor and outdoor industrial environments. 

To explore this potential, we conducted indoor measurements and found that GNSS satellite signals can penetrate windows at different angles and reflect or diffract from surfaces like floors and ceilings, resulting in uneven signals. Metasurfaces can control structured arrays of electromagnetic signals, allowing them to capture and redirect more GNSS signals. This allows signals to enter buildings in a path parallel to the ground, achieving broader coverage. Using this capability, we developed a GNSS positioning metasurface system (GPMS) based on passive metasurface technology.

One limitation of passive metasurfaces is their lack of programmability. To overcome this and enable them to effectively guide signals from different angles and scatter them in parallel, we designed a two-layer metasurface system. As shown in Figure 1, this design ensures that electromagnetic waves from different angles follow similar emission trajectories.  

A diagram showing the optimization of metasurfaces for enhancing GNSS signals indoors. It includes two GNSS satellites, far-field channels, a near-field channel matrix, a passive metasurface grid, and colorful 3D waveforms. The target radiation matrix is shown with indoor users. The text reads: “Optimization problem: The radiation output of our designed metasurfaces should all be close to the target radiation for GNSS signal input at all incidence angles.”
Figure 1: The GPMS two-layer metasurface structure

To improve positioning accuracy, we developed new algorithms that allow signals to pass through metasurfaces, using them as anchor points. Traditional GPS positioning requires signals from at least four satellites to decode location information. In the GPMS system, illustrated in Figure 2, each deployed metasurface functions as a virtual satellite. By deploying at least three metasurfaces indoors, we achieved high-precision positioning through a triangulation algorithm.

The image depicts a shopping mall indoor environment with three metasurfaces labeled Metasurface 1, Metasurface 2, and Metasurface 3. Each metasurface is associated with a steering and scattering area, labeled Steering and scattering area 1, Steering and scattering area 2, and Steering and scattering area 3 respectively. GNSS satellites are shown outside the building. The image illustrates how GNSS signals interact with metasurfaces within an indoor environment.
Figure 2. Diagram of the GPMS system. Passive metasurfaces guide GNSS signals indoors, while enhanced positioning algorithms provide precise indoor positioning on mobile devices. 

To evaluate the system, we deployed the GPMS with six metasurfaces on a 10×50-meter office floor and a 15×20-meter conference hall. The results show significant improvements in signal quality and availability. C/N₀, a measure of signal-to-noise ratio, increased from 9.1 dB-Hz to 32.2 dB-Hz. The number of visible satellites increased from 3.6 to 21.5. Finally, the absolute positioning error decreased from 30.6 meters to 3.2 meters in the office and from 11.2 meters to 2.7 meters in the conference hall. These findings are promising and highlight the feasibility and advantages of GNSS-based metasurfaces for indoor positioning. 

Metasurfaces extend millimeter-wave coverage

Millimeter waves enable the high-speed, low-latency performance needed for 5G and 6G communication systems. While commercial products like 60 GHz Wi-Fi routers and mobile devices are becoming popular, their limited coverage and susceptibility to signal obstruction restrict their widespread application. 

Traditional solutions include deploying multiple millimeter-wave access points, such as routers or base stations, or placing reflective metal panels in room corners to reflect electromagnetic waves. However, these approaches are both costly and offer limited performance. Metasurfaces offer a promising alternative for improving millimeter-wave applications. Previous research has shown that programmable metasurfaces can enhance signal coverage in blind spots and significantly improve signal quality and efficiency.  

To maximize the benefits of metasurfaces, we developed the AutoMS automation service framework, shown in Figure 3. This proposed framework can optimize millimeter-wave coverage using low-cost passive metasurface design and strategic placement. 

The three main components of AutoMS can address the limitations of traditional solutions: 

  1. Automated joint optimization: AutoMS determines the optimal network deployment configuration by analyzing phase settings, metasurface placement, and access point positioning. It also refines beam-forming configurations to enhance signal coverage. By iteratively identifying and optimizing the number, size, and placement of metasurfaces, AutoMS adjusts the metasurface phase settings and the access point’s configurations to achieve optimal signal coverage. 
A flowchart diagram illustrating the AutoMS framework, which generates optimized passive metasurface and access point deployment plans for a specific 3D model based on environmental scanning results. The process starts with an environment scan, producing a 3D model and reflection coefficients. This information feeds into wireless channel modeling, which along with deployment configurations, is optimized by a hyper-configuration tuner. The output includes phase maps used by the surface and AP optimizer. The optimized deployment configurations are then used for metasurface fabrication and network deployment.
Figure 3. The AutoMS framework generates optimized deployment plans for passive metasurface and access points based on environment scanning results. 
  1. Fast 3D ray tracing simulator: Using hardware and software acceleration, our simulator efficiently calculates channel matrices resulting from metasurfaces with tens of thousands of elements. This simulator, capable of tracing 1.3 billion rays in just three minutes on an A100 GPU, significantly accelerates calculations for complex environments.
  1. Low-cost passive metasurface design: We designed a high-reflectivity passive metasurface with near-2π phase control and broadband compatibility for the millimeter-wave frequency band. This metasurface is compatible with low-precision, cost-effective thermoforming processes. This process enables users to create metasurfaces at minimal cost, significantly reducing deployment expenses.

    Shown in Figure 4, users can capture the environment using existing 3D scanning apps on mobile devices, generate a 3D layout model, and upload it to the cloud. AutoMS then generates metasurface settings and placement guidelines.  

    Users can print metasurface patterns using hot stamping and customize them without affecting functionality, as millimeter waves penetrate paint and paper. 

A step-by-step process for creating low-cost passive metasurfaces. Step 1: Print patterns on paper with a laser printer. Step 2: Hot stamp aluminum foil on paper with a laminator. Step 3: Tear the aluminum foil off to get the metallic patterns. Step 4: Paste patterns on the plastic sheet and aluminum board.
Figure 4: The low-cost passive metasurface creation process 

Evaluation using publicly available 3D layout datasets and real-world tests shows that AutoMS significantly improves millimeter-wave coverage across various scenarios. Compared to a single router setup, AutoMS increased signal strength by 12.1 dB. Onsite tests further confirmed gains of 11 dB in target areas and over 20 dB in blind spots, with signal throughput increasing from 77 Mbps to 373 Mbps. AutoMS adapts to diverse environments, ensuring reliable and flexible deployment in real-world applications. 

Metasurfaces support uniform heating in microwave ovens 

Microwave ovens often heat unevenly, creating cold spots in food. These can allow harmful bacteria and other pathogens to survive, increasing the risk of foodborne illnesses. Uneven heating can cause eggs to burst or create “hot spots” that can scald.

Uneven heating is due to the appliance’s heating mechanism. Microwave ovens generate high-power radio frequency (RF) electromagnetic waves through dielectric heating. These waves create nodes with zero amplitude, which prevents heating. They also create antinodes, where heating occurs more rapidly.  

To address this issue, we developed MicroSurf, a low-cost solution that improves heating by using passive metasurfaces to control electromagnetic energy inside the microwave oven. It uses the resonance effect between the metasurface and electromagnetic waves to modify the standing-wave distribution and achieve more uniform heating. This is shown in Figure 5. 

A diagram illustrating the working principle of MicroSurf in four parts. A shows an uneven electric field distribution inside a microwave oven leading to uneven heating, with images of a microwave and thermal images of food. B depicts accurate modeling of the microwave oven, including geometry refinement, dielectric factor tuning, and frequency tuning. C involves designing and optimizing a metasurface that can function in a high-power environment to change the standing wave distribution, with an image of a high-power phase-tuning metasurface. D demonstrates achieving uniform heating of different foods and selectively heating specific parts of food, with thermal images showing uniform heating results.
Figure 5: MicroSurf’s working principle: Uneven electric field distribution inside the microwave oven leads to uneven heating. B. Modeling the microwave oven. C. Designing and optimizing a metasurface that can function in a high-power environment to change the standing wave distribution. D. Achieving uniform heating of different foods and selectively heating specific parts. 

Tests across four different microwave oven brands demonstrate that MicroSurf effectively optimizes heating for various liquids and solids, uniformly heating water, milk, bread, and meat. It concentrates heat on specific areas and adapts to differently shaped foods. MicroSurf offers a promising solution for even heating in microwave ovens, demonstrating the potential of metasurface technology in everyday applications. This innovation paves the way for smarter, more efficient home appliances.  

Advancing wireless innovation

Wireless sensing and communication technologies are evolving rapidly, driving innovation across a wide range of applications. We are continuing to push the boundaries of these technologies—particularly in metasurface development—while working to create practical solutions for a variety of use cases. 

The post Metasurface: Unlocking the future of wireless sensing and communication appeared first on Microsoft Research.

Read More

Introducing KBLaM: Bringing plug-and-play external knowledge to LLMs

Introducing KBLaM: Bringing plug-and-play external knowledge to LLMs

KBLaM blog | A flowchart illustrating the process of handling a prompt using a language model. The process begins with documents being used to construct and summarize a knowledge base (KB) offline. The summarized KB is then encoded and fed into the main process. A prompt goes through a tokenizer, followed by rectangular attention, and then into the large language model (LLM). The LLM retrieves information from the encoded KB to generate an answer.

Large language models (LLMs) have demonstrated remarkable capabilities in reasoning, language understanding, and even creative tasks. Yet, a key challenge persists: how to efficiently integrate external knowledge.

Traditional methods such as fine-tuning and Retrieval-Augmented Generation (RAG) come with trade-offs—fine-tuning demands costly retraining, while RAG introduces separate retrieval modules that increase complexity and prevent seamless, end-to-end training. In-context learning, on the other hand, becomes increasingly inefficient as knowledge bases grow, facing quadratic computational scaling that hinders its ability to handle large repositories. A comparison of these approaches can be seen in Figure 1.

A new way to integrate knowledge

To address these challenges, we introduce the Knowledge Base-Augmented Language Model (KBLaM) —a novel approach that integrates structured knowledge bases into pre-trained LLMs. Instead of relying on external retrieval modules or costly fine-tuning, KBLaM encodes knowledge into continuous key-value vector pairs, efficiently embedding them within the model’s attention layers using a specialized rectangular attention mechanism, which implicitly performs retrieval in an integrated manner.

We use structured knowledge bases to represent the data, allowing us to consolidate knowledge and leverage structure. This design allows it to scale linearly with the size of the knowledge base while maintaining dynamic updates without retraining, making it far more efficient than existing methods.

Microsoft research podcast

NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

Just after his NeurIPS 2024 keynote on the co-evolution of systems and AI, Microsoft CVP Lidong Zhou joins the podcast to discuss how rapidly advancing AI impacts the systems supporting it and the opportunities to use AI to enhance systems engineering itself.


Scalable, efficient, and future-ready

At its core, KBLaM is designed to integrate structured knowledge into LLMs, making them more efficient and scalable. It achieves this by converting external knowledge bases—collections of facts structured as triples consisting of an entity, a property, and a value—into a format that LLMs can process naturally.  Such knowledge bases allow for consolidated, reliable sources of knowledge.

To create these knowledge bases, we first extract structured data in JSON format using small language models. We then apply Project Alexandria’s probabilistic clustering. Once we have this structured knowledge base, KBLaM follows a three-step pipeline:

  1. Knowledge Encoding: Each knowledge triple is mapped into a key-value vector pair using a pre-trained sentence encoder with lightweight linear adapters. The key vector, derived from the entity name and property, encodes “index information,” while the value vector captures the corresponding property value. This allows us to create continuous, learnable key-value representations.
  2. Integration with LLMs: These key-value pairs, or knowledge tokens, are augmented into the model’s attention layers using a specialized rectangular attention structure. Unlike traditional transformer models that process all tokens equally and come with quadratic cost—such as GPT-4, Phi, and Llama—rectangular attention enables the model to attend over knowledge with linear cost, as illustrated in Figure 2. Compared to standard attention mechanisms in generative language models, where each token attends to all preceding tokens, our approach introduces a more efficient structure. In this setup, language tokens (such as those from a user’s question) attend to all knowledge tokens. However, knowledge tokens do not attend to one another, nor do they attend back to the language tokens. This selective attention pattern significantly reduces computational cost while preserving the model’s ability to incorporate external knowledge effectively.

    This linear cost, which is crucial for the efficiency of KBLaM, effectively amounts to treating each fact independently—an assumption that holds for most facts. For example, the model’s name, KBLaM, and the fact that the research was conducted at Microsoft Research are very weakly correlated. This rectangular attention is implemented as an extension of standard attention. During training, we keep the base model’s weights frozen, ensuring that when no knowledge tokens are provided, the model functions exactly as it did originally.

  3. Efficient Knowledge Retrieval: Through this rectangular attention, the model learns to dynamically retrieve relevant knowledge tokens during inference, eliminating the need for separate retrieval steps.
Figure 1: A diagram comparing KBLaM and existing approaches. With RAG, we take the user’s prompt and use that to retrieve relevant documents from an external corpus using some retriever module, and append a tokenized version of those relevant documents in the context. This is relatively cheap, but requires many components. On the other hand, In Context Learning just puts the entire corpus into the context. This is simple, involving only one component, but is expensive. Our method, KBLaM, makes a structured knowledge base from the documents in an offline process, and includes the entire knowledge base to the context, while using a novel variant of attention, rectangular attention, so that the cost is linear in the size of the knowledge base. This results in a system where the retrieval only requires a single, trainable component, that is also cheap.
Figure 1: KBLaM allows for attention over the entire knowledge base instead of having an external retriever.
Figure 2: A diagram illustrating rectangular attention. Unlike regular attention, the attention matrix is not square, as we remove the parts where the knowledge base would attend over itself. This allows for KBLaM to scale linearly with the number of items in its context.
Figure 2: By having the user’s question attend to the knowledge base, while treating facts in the knowledge base independently, KBLaM scales efficiently and linearly with the size of the knowledge base.

Unlike RAG, which appends retrieved document chunks to prompts, KBLaM allows for direct integration of knowledge into the model. Compared to in-context learning,  KBLaM’s rectangular attention maintains a linear memory footprint, making it vastly more scalable for large knowledge bases. 

Its efficiency is a game-changer. While traditional in-context learning methods struggle with quadratic memory growth due to self-attention overhead, KBLaM’s linear overhead means we can store much more knowledge in the context. In practice, this means KBLaM can store and process over 10,000 knowledge triples, the equivalent of approximately 200,000 text tokens on a single GPU—a feat that would be computationally prohibitive with conventional in-context learning. The results across a wide range of triples and can be seen in Figure 3. Remarkably, it achieves this while extending a base model that has a context length of only 8K tokens. Additionally, KBLaM enables dynamic updates: modifying a single knowledge triple does not require retraining or re-computation of the entire knowledge base. 

Figure 3: Two graphs, showing time to first token, and memory usage for both KBLaM and RAG. KBLaM’s time to first token remains relatively constant across a large range of knowledge base sizes, with the time-to-first-token with 4096 triples in the context being lower than that of conventional RAG with 5 triples in the context. The memory usage is also much lower, with KBLaM with 512 triples having a similar memory usage to RAG at 5 triples.
Figure 3: KBLaM is much faster and uses much less memory than adding the equivalent number of triples in the context using conventional RAG-like approaches. In particular, we have lower time to first token with 4,096 tripes in the context with KBLaM than we would with 5 triples in the context.

Enhancing interpretability and reliability

Another major benefit of KBLaM is its interpretability. Unlike in-context learning, where knowledge injection is opaque, KBLAM’s attention weights provide clear insights into how the model utilizes knowledge tokens. Experiments show that KBLaM assigns high attention scores to relevant knowledge triples, effectively mimicking a soft retrieval process.

Furthermore, KBLaM enhances model reliability by learning through its training examples when not to answer a question if the necessary information is missing from the knowledge base. In particular, with knowledge bases larger than approximately 200 triples, we found that the model refuses to answer questions it has no knowledge about more precisely than a model given the information as text in context. This feature helps reduce hallucinations, a common problem in LLMs that rely on internal knowledge alone, making responses more accurate and trustworthy.

The future of knowledge-augmented AI

KBLaM represents a major step forward in integrating structured knowledge into LLMs. By offering a scalable, efficient, and interpretable alternative to existing techniques, it paves the way for AI systems that can stay up to date and provide reliable, knowledge-driven responses. In fields where accuracy and trust are critical—such as medicine, finance, and scientific research—this approach has the potential to transform how language models interact with real-world information.

As AI systems increasingly rely on dynamic knowledge rather than static model parameters, we hope KBLaM will serve as a bridge between raw computational power and real-world understanding.

However, there is still work to be done before it can be deployed at scale. Our current model has been trained primarily on factual question-answer pairs, and further research is needed to expand its capabilities across more complex reasoning tasks and diverse knowledge domains.

To accelerate progress, we are releasing KBLaM’s code and datasets (opens in new tab) to the research community, and we are planning integrations with the Hugging Face transformers library. By making these resources available, we hope to inspire further research and adoption of scalable, efficient knowledge augmentation for LLMs. The future of AI isn’t just about generating text—it’s about generating knowledge that is accurate, adaptable, and deeply integrated with the evolving world. KBLaM is a step in that direction.

The post Introducing KBLaM: Bringing plug-and-play external knowledge to LLMs appeared first on Microsoft Research.

Read More

Semantic Telemetry: Understanding how users interact with AI systems

Semantic Telemetry: Understanding how users interact with AI systems

Semantic Telemetry blog | diagram showing relationships between chat, LLM prompt, and labeled data

AI tools are proving useful across a range of applications, from helping to drive the new era of business transformation to helping artists craft songs. But which applications are providing the most value to users? We’ll dig into that question in a series of blog posts that introduce the Semantic Telemetry project at Microsoft Research. In this initial post, we will introduce a new data science approach that we will use to analyze topics and task complexity of Copilot in Bing usage.

Human-AI interactions can be iterative and complex, requiring a new data science approach to understand user behavior to build and support increasingly high value use cases. Imagine the following chat:

Example chat between user and AI

Here we see that chats can be complex and span multiple topics, such as event planning, team building, and logistics. Generative AI has ushered in a two-fold paradigm shift. First, LLMs give us a new thing to measure, that is, how people interact with AI systems. Second, they give us a new way to measure those interactions, that is, they give us the capability to understand and make inferences on these interactions, at scale. The Semantic Telemetry project has created new measures to classify human-AI interactions and understand user behavior, contributing to efforts in developing new approaches for measuring generative AI (opens in new tab) across various use cases.

Semantic Telemetry is a rethink of traditional telemetry–in which data is collected for understanding systems–designed for analyzing chat-based AI. We employ an innovative data science methodology that uses a large language model (LLM) to generate meaningful categorical labels, enabling us to gain insights into chat log data.

Flow chart illustrating the LLM classification process starting with chat input, then prompting LLM with chat using generated label taxonomy, and output is the labeled chat.
Figure 1: Prompting an LLM to classify a conversation based on LLM generated label taxonomy

This process begins with developing a set of classifications and definitions. We create these classifications by instructing an LLM to generate a short summary of the conversation, and then iteratively prompting the LLM to generate, update, and review classification labels on a batched set of summaries. This process is outlined in the paper: TnT-LLM: Text Mining at Scale with Large Language Models. We then prompt an LLM with these generated classifiers to label new unstructured (and unlabeled) chat log data.

Description of LLM generated label taxonomy process

With this approach, we have analyzed how people interact with Copilot in Bing. In this blog, we examine insights into how people are using Copilot in Bing, including how that differs from traditional search engines. Note that all analyses were conducted on anonymous Copilot interactions containing no personal information.

Topics

To get a clear picture of how people are using Copilot in Bing, we need to first classify sessions into topical categories. To do this, we developed a topic classifier. We used the LLM classification approach described above to label the primary topic (domain) for the entire content of the chat. Although a single chat can cover multiple topics, for this analysis, we generated a single label for the primary topic of the conversation. We sampled five million anonymized Copilot in Bing chats during August and September 2024, and found that globally, 21% of all chats were about technology, with a high concentration of these chats in programming and scripting and computers and electronics.

Bubble chart showing topics based on percentage of sample. Primary topics shown are Technology (21%), Entertainment (12.8%), Health (11%), Language, Writing, & Editing (11.6%), Lifestyle (9.2%), Money (8.5%), History, Events, & Law (8.5%), Career (7.8%), Science (6.3%)
Figure 2: Top Copilot in Bing topics based on anonymized data (August-September 2024)
Bubble chart of Technology topic showing subtopics: Programming & scripting, Computers & electronics, Engineering & design, Data analysis, and ML & AI.
Figure 3: Frequent topic summaries in Technology
Bubble chart of Entertainment showing subtopics: Entertainment, Sports & fitness, Travel & tourism, Small talk & chatbot, and Gaming
Figure 4: Frequent topic summaries in Entertainment

Diving into the technology category, we find a lot of professional tasks in programming and scripting, where users request problem-specific assistance such as fixing a SQL query syntax error. In computers and electronics, we observe users getting help with tasks like adjusting screen brightness and troubleshooting internet connectivity issues. We can compare this with our second most common topic, entertainment, in which we see users seeking information related to personal activities like hiking and game nights.

We also note that top topics differ by platform. The figure below depicts topic popularity based on mobile and desktop usage. Mobile device users tend to use the chat for more personal-related tasks such as helping to plant a garden or understanding medical symptoms whereas desktop users conduct more professional tasks like revising an email.

Sankey visual showing top topics for Desktop and Mobile users
Figure 5: Top topics for desktop users and mobile users

Microsoft research blog

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

PromptWizard from Microsoft Research is now open source. It is designed to automate and simplify AI prompt optimization, combining iterative LLM feedback with efficient exploration and refinement techniques to create highly effective prompts in minutes.


Search versus Copilot

Beyond analyzing topics, we compared Copilot in Bing usage to that of traditional search. Chat extends beyond traditional online search by enabling users to summarize, generate, compare, and analyze information. Human-AI interactions are conversational and more complex than traditional search (Figure 6).

Venn diagram showing differences between Bing Search and Copilot in Bing, with intersection in information lookup.
Figure 6: Bing Search Query compared to Copilot in Bing Conversation

A major differentiation between search and chat is the ability to ask more complex questions, but how can we measure this? We think of complexity as a scale ranging from simply asking chat to look up information to evaluating several ideas. We aim to understand the difficulty of a task if performed by a human without the assistance of AI. To achieve this, we developed the task complexity classifier, which assesses task difficulty using Anderson and Krathwohl’s Taxonomy of Learning Objectives (opens in new tab). For our analysis, we have grouped the learning objectives into two categories: low complexity and high complexity. Any task more complicated than information lookup is classified as high complexity. Note that this would be very challenging to classify using traditional data science techniques.

Description of task complexity and 6 categories of the Anderson and Krathwohl’s Taxonomy of Learning Objectives

Comparing low versus high complexity tasks, most chat interactions were categorized as high complexity (78.9%), meaning that they were more complex than looking up information. Programming and scripting, marketing and sales, and creative and professional writing are topics in which users engage in higher complexity tasks (Figure 7) such as learning a skill, troubleshooting a problem, or writing an article.

Highest and lowest complexity topics based on percent of high complexity chats
Figure 7: Most and least complex topics based on percentage of high complexity tasks.

Travel and tourism and history and culture scored lowest in complexity, with users looking up information like flights time and latest news updates.

Demo of task complexity and topics on anonymous Copilot interactions

When should you use chat instead of search? A 2024 Microsoft Research study: The Use of Generative Search Engines for Knowledge Work and Complex Tasks, suggests that people are seeing value in technical, complex tasks such as web development and data analysis. Bing Search contained more queries with lower complexity focused on non-professional areas, like gaming and entertainment, travel and tourism, and fashion and beauty, while chat had a greater distribution of complex technical tasks. (Figure 8).

Comparison of Bing Search and Copilot in Bing topics based on complexity and knowledge work. Copilot in Bing trends greater complexity and greater knowledge work than Bing Search.
Figure 8: Comparison of Bing Search and Copilot in Bing for anonymized sample data (May-June 2023)

Conclusion

LLMs have enabled a new era of high-quality human-AI interaction, and with it, the capability to analyze those same interactions with high fidelity, at scale, and near real-time. We are now able to obtain actionable insight from complex data that is not possible with traditional data science pattern-matching methods. LLM-generated classifications are pushing research into new directions that will ultimately improve user experience and satisfaction when using chat and other user-AI interactions tools.

This analysis indicates that Copilot in Bing is enabling users to do more complex work, specifically in areas such as technology. In our next post, we will explore how Copilot in Bing is supporting professional knowledge work and how we can use these measures as indicators for retention and engagement.


FOOTNOTE: This research was conducted at the time the feature Copilot in Bing was available as part of the Bing service; since October 2024 Copilot in Bing has been deprecated in favor of the standalone Microsoft Copilot service.

References:

  1. Krathwohl, D. R. (2002). A Revision of Bloom’s Taxonomy: An Overview. Theory Into Practice, 41(4), 212–218. https://doi.org/10.1207/s15430421tip4104_2 (opens in new tab)

The post Semantic Telemetry: Understanding how users interact with AI systems appeared first on Microsoft Research.

Read More

The AI Revolution in Medicine, Revisited: An Introduction

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote a book full of optimism for the potential of advanced AI models to transform the world of healthcare. What has happened since? In this special podcast series, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee.

In this introduction to the series, Lee talks about his early encounters with GPT-4, when the AI model was still in secret development with OpenAI, and the range of emotions he cycled through as he came to understand the new technology better. The emergence of generative AI has created a “new world,” Lee says, one he is eager to investigate with the aim of discovering the technology’s impact so far and what it means for the future of healthcare and medicine.

Transcript

[MUSIC]

PETER LEE: This is The AI Revolution in Medicine, Revisited. I’m Peter Lee, president of Microsoft Research, and I’m pretty excited to introduce this series of conversations as part of the Microsoft Research Podcast.

About two years ago, with Carey Goldberg and Zak Kohane, we wrote a book, The AI Revolution in Medicine. This was a book that was intended to educate the world of healthcare and the world of medical research about this new thing that was emerging. This idea of generative AI. And we wrote the book in secret. In fact, the whole existence of what we now know of as OpenAI’s GPT-4 AI model hadn’t been publicly disclosed or revealed to the world. And so when we were working on this book, we had to make some guesses. What is this going to mean for healthcare? If you’re a doctor or a nurse, in what ways will AI impact your work? If you’re a patient, in what ways could AI change your experience as you try to navigate a complex healthcare system?

And so now it’s been about two years. Two years hence, what did we get right? What did we get wrong? What things have come along much faster than we ever would have dreamed of? What did we miss? And what things have turned out to be much harder than we ever could have realized? And so this series of conversations is going to talk to people in the real world. We’ll delve into exactly what’s happening in the clinic, the patient experience, how people are thinking about safety and regulatory matters, and what this all means for discovery and advancements of medical science. And even then, we’ll have guests that will allow us to look into the future—the AI advances that are happening now and what is going to happen next.


[MUSIC TRANSITIONS TO SERIES THEME] [MUSIC FADES]

So now, let me just take a step back here to talk about this book project. And I’d like to just read the first couple of sentences in Chapter 1, and Chapter 1 is entitled “First Contact.” And it starts with a quote. Quote, “I think that Zak and his mother deserve better than that,” unquote. “I was being scolded. And while I’ve been scolded plenty in my life, for the first time it wasn’t a person scolding me; it was an artificial intelligence system.” So that’s how we started this book, and I wanted to read that because, at least for me, it takes me back to the kind of awe and wonderment in those early days when in secret development, we had access from OpenAI to what we now know of as GPT-4.

And what was that quote about? Well, after getting access to GPT-4, I became very interested in what this might mean for healthcare. But I, not being a doctor, knew I needed help. So I had reached out to a good colleague of mine who is a doctor, a pediatric endocrinologist, and head of the bioinformatics department at Harvard Medical School, Dr. Isaac “Zak” Kohane. And I sought his help. And in our back-and-forth discussions, one of the things that Zak shared with me was an article that he wrote for a magazine where he talked about his use of machine learning in the care of his 90-year-old mother, his 90-year-old mother, who—like many 90-year-old people—was having some health issues.

And this article was very interesting. It really went into some detail about not only the machine learning technology that Zak had created in order to help manage his mother’s health but also the kind of emotional burden of doing this and in what ways technology was helping Zak cope with that. And so as I read that article, it touched me because at that time, I was struggling in a very similar way with my own father, who was at that time 89 years old and was also suffering from some very significant health issues. And, like Zak, I was feeling some pangs of guilt because my father was living in Southern California; I was way up in the Pacific Northwest, you know, just feeling guilty not being there, present for him, through his struggles. And reading that article a thought that occurred to me was, I wonder if in the future, AI could pretend to be me so that my father could always have a version of me to talk to. And I also had the thought in the other direction. Could AI someday capture enough of my father so that when and if he passes, I always have some memory of my father that I could interact with? A strange and bizarre thought, I admit, but a natural one, I think, for any human being that’s encountering this amazing AI technology for the first time. And so I ran an experiment. I used GPT-4 to read Zak’s article and then posed the question to GPT-4, “Based on this article, could you pretend to be Zak? I’ll pretend to be Zak’s mother, and let’s test whether it’s possible to have a mother-son conversation.”

To my surprise, GPT-4’s response at that time was to scold me, basically saying that this is wrong; that this has a lot of dangers and risks. You know, what if Zak’s mother really needs the real Zak. And in those early days of this encounter with AI, that was incredibly startling. It just really forces you to reexamine yourself, and it kicked off our writing in the book as really not only being about a technology that could help lead to better diagnoses, help reduce medical errors, reduce the amount of paperwork and clerical burden that doctors go through, could help demystify and help patients navigate a healthcare system, but it could actually be a technology that forces people to reexamine their relationships and reexamine what it really means for people to take care of other people.

And since then, of course, I’ve come to learn that many people have had similar experiences in their first encounters with AI. And in fact, I’ve come to think of this as, somewhat tongue in cheek, the nine stages of AI grief. And they actually relate to what we’ll try to address in this new series of conversations.

For me, the first time that Greg Brockman and Sam Altman presented what we now know of as OpenAI’s GPT-4 to me, they made some claims about what it could do. And my first reaction was one of skepticism, and it seemed that the claims that were being made just couldn’t be true. Then that, kind of, passed into, I would say, a period of annoyance because I started to see my colleagues here in Microsoft Research start to show some amazement about the technology. I actually was annoyed because I felt they were being duped by this technology. So that’s the second phase. And then, the third phase was concern and maybe even a little bit of frustration because it became clear that, as a company here at Microsoft, we were on the verge of making a big bet on this new technology. And that was concerning to me because of my fundamental skepticism. But then I got my hands on the technology myself. And that enters into a fourth stage, of amazement. You start to encounter things that just are fundamentally amazing. This leads to a period of intensity because I immediately surmised that, wow, this could really change everything and in very few areas other than healthcare would be more important areas of change. And that is stage five, a period of serious intensity where you’re just losing sleep and working so hard to try to imagine what this all could mean. Running as many experiments as you can; trying to lean on as much real expertise as possible. You then lead from there into a period of what I call chagrin because as amazing as the technology is, actually understanding how to harness it in real life is not easy. You finally get into this stage of what I would call enlightenment. [MUSIC] And I won’t claim to be enlightened. But it is, sort of, a combination of acceptance that we are in a new world today, that things are happening for real, and that there’s, sort of, no turning back. And at that point, I think we can really get down to work. And so as we think about really the ultimate purpose of this series of conversations that we’re about to have, it’s really to help people get to that stage of enlightenment, to really, kind of, roll up our sleeves, to sit down and think through all of the best knowledge and experience that we’ve gathered over the last two years, and chart the future of this AI revolution in medicine.

[MUSIC TRANSITIONS TO SERIES THEME]

Let’s get going.

[MUSIC FADES]

The post The AI Revolution in Medicine, Revisited: An Introduction appeared first on Microsoft Research.

Read More

Advancing biomedical discovery: Overcoming data challenges in precision medicine

Advancing biomedical discovery: Overcoming data challenges in precision medicine

white line icon of a medical paper and of a computer with a person in front of it on a blue and green gradient background

Introduction

Modern biomedical research is driven by the promise of precision medicine—tailored treatments for individual patients through the integration of diverse, large-scale datasets. Yet, the journey from raw data to actionable insights is fraught with challenges. Our team of researchers at Microsoft Research in the Health Futures group, in collaboration with the Perelman School of Medicine at the University of Pennsylvania (opens in new tab), conducted an in-depth exploration of these challenges in a study published in Nature Scientific Reports. The goal of this research was to identify pain points in the biomedical data lifecycle and offer actionable recommendations to enable secure data-sharing, improved interoperability, robust analysis, and foster collaboration across the biomedical research community.

Study at a glance

A deep understanding of the biomedical discovery process is crucial for advancing modern precision medicine initiatives. To explore this, our study involved in-depth, semi-structured interviews with biomedical research professionals spanning various roles including bench scientists, computational biologists, researchers, clinicians, and data curators. Participants provided detailed insights into their workflows, from data acquisition and curation to analysis and result dissemination. We used an inductive-deductive thematic analysis to identify key challenges occurring at each stage of the data lifecycle—from raw data collection to the communication of data-driven findings.

Some key challenges identified include:

  • Data procurement and validation: Researchers struggle to identify and secure the right datasets for their research questions, often battling inconsistent quality and manual data validation.
  • Computational hurdles: The integration of multiomic data requires navigating disparate computational environments and rapidly evolving toolsets, which can hinder reproducible analysis.
  • Data distribution and collaboration: The absence of a unified data workflow and secure sharing infrastructure often leads to bottlenecks when coordinating between stakeholders across university labs, pharmaceutical companies, clinical settings, and third-party vendors.

Main takeaways and recommendations:

  1. Establishing a unified biomedical data lifecycle 

    This study highlights the need for a unified process that spans all phases of the biomedical discovery process—from data-gathering and curation to analysis and dissemination. Such a data jobs-to-be-done framework would streamline standardized quality checks, reduce manual errors such as metadata reformatting, and ensure that the flow of data across different research phases remains secure and consistent. This harmonization is essential to accelerate research and build more robust, reproducible models that propel precision medicine forward.

  2. Empowering stakeholder collaboration and secure data sharing 

    Effective biomedical discovery requires collaboration across multiple disciplines and institutions. A key takeaway from our interviews was the critical importance of collaboration and trust among stakeholders. Secure, user-friendly platforms that enable real-time data sharing and open communication among clinical trial managers, clinicians, computational scientists, and regulators can bridge the gap between isolated research silos. As a possible solution, by implementing centralized cloud-based infrastructures and democratizing data access, organizations can dramatically reduce data handoff issues and accelerate scientific discovery.

  3. Adopting actionable recommendations to address data pain points 

    Based on the insights from this study, the authors propose a list of actionable recommendations such as:

    • Creating user-friendly platforms to transition from manual (bench-side) data collection to electronic systems.
    • Standardizing analysis workflows to facilitate reproducibility, including version control and the seamless integration of notebooks into larger workflows.
    • Leveraging emerging technologies such as generative AI and transformer models for automating data ingestion and processing of unstructured text.

If implemented, the recommendations from this study would help forge a reliable, scalable infrastructure for managing the complexity of biomedical data, ultimately advancing research and clinical outcomes.

Looking ahead

At Microsoft Research, we believe in the power of interdisciplinarity and innovation. This study not only identifies the critical pain points that have slowed biomedical discovery but also illustrates a clear path toward improved data integrity, interoperability, and collaboration. By uniting diverse stakeholders around a common, secure, and scalable data research lifecycle, we edge closer to realizing individualized therapeutics for every patient.

We encourage our colleagues, partners, and the broader research community to review the full study and consider these insights as key steps toward a more integrated biomedical data research infrastructure. The future of precision medicine depends on our ability to break down data silos and create a research data lifecycle that is both robust and responsive to the challenges of big data.

Explore the full paper (opens in new tab) in Nature Scientific Reports to see how these recommendations were derived, and consider how they might integrate into your work. Let’s reimagine biomedical discovery together—where every stakeholder contributes to a secure, interoperable, and innovative data ecosystem that transforms patient care.

We look forward to engaging with the community on these ideas as we continue to push the boundaries of biomedical discovery at Microsoft Research.

The post Advancing biomedical discovery: Overcoming data challenges in precision medicine appeared first on Microsoft Research.

Read More

Magma: A foundation model for multimodal AI agents across digital and physical worlds

Magma: A foundation model for multimodal AI agents across digital and physical worlds

Gradient background transitioning from blue on the left to pink on the right. In the center, a rectangular box with ‘MAGMA’ written in bold white letters. To the left, an icon of a globe representing Earth. To the right, an icon of a computer monitor displaying a globe. Arrows connect these three elements in a circular flow, indicating interaction or data exchange between Earth, MAGMA, and the computer.

Imagine an AI system capable of guiding a robot to manipulate physical objects as effortlessly as it navigates software menus. Such seamless integration of digital and physical tasks has long been the stuff of science fiction.  

Today, Microsoft researchers are bringing that vision closer to reality with Magma (opens in new tab), a multimodal AI foundation model designed to process information and generate action proposals across both digital and physical environments. It is designed to enable AI agents to interpret user interfaces and suggest actions like button clicks, while also orchestrating robotic movements and interactions in the physical world.  

Built on the foundation model paradigm, Magma is pretrained on an expansive and diverse dataset, allowing it to generalize better across tasks and environments than smaller, task-specific models. As illustrated in Figure 1, Magma synthesizes visual and textual inputs to generate meaningful actions—whether executing a command in software or grabbing a tool in the physical world. This new model represents a significant step toward AI agents that can serve as versatile, general-purpose assistants. 

Given a described goal, Magma can formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial and temporal intelligence to navigate complex tasks and settings.
Figure 1: Magma is one of the first foundation models that is capable of interpreting and grounding multimodal inputs within both digital and physical environments. Given a described goal, Magma can formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial and temporal intelligence to navigate complex tasks and settings.

Vision-Language-Action (VLA) models integrate visual perception, language comprehension, and action reasoning to enable AI systems to interpret images, process textual instructions, and propose actions. These models bridge the gap between multimodal understanding and real-world interaction. Typically pretrained on large numbers of VLA datasets, they acquire the ability to understand visual content, process language, and perceive and interact with the spatial world, allowing them to perform a wide range of tasks. However, due to the dramatic difference among various digital and physical environments, separate VLA models are trained and used for different environments. As a result, these models struggle to generalize to new tasks and environments outside of their training data. Moreover, most of these models do not leverage pretrained vision-language (VL) models or diverse VL datasets, which hampers their understanding of VL relations and generalizability.  

Magma, to the best of our knowledge, is one of the first VLA foundation model that can adapt to new tasks in both digital and physical environments, which helps AI-powered assistants or robots understand their surroundings and suggest appropriate actions. For example, it could enable a home assistant robot to learn how to organize a new type of object it has never encountered or help a virtual assistant generate step-by-step user interface navigation instructions for an unfamiliar task. Through Magma, we demonstrate the advantages of pretraining a single VLA model for AI agents across multiple environments while still achieving state-of-the-art results on user interface navigation and robotic manipulation tasks, outperforming previous models that are tailored to these specific domains. On VL tasks, Magma also compares favorably to popular VL models that are trained on much larger datasets. 

Building a foundation model that spans such different modalities has required us to rethink how we train and supervise AI agents. Magma introduces a novel training paradigm centered on two key innovations: Set-of-Mark (SoM) and Trace-of-Mark (ToM) annotations. These techniques developed by Microsoft Research, imbue the model with a structured understanding of tasks in both user interface navigation and robotic manipulation domains. 

  • Set-of-Mark (SoM): SoM is an annotated set of key objects, or interface elements that are relevant to achieving a given goal. For example, if the task is to navigate a web page, the SoM includes all the bounding boxes for clickable user interface elements. In a physical task like setting a table, the SoM could include the plate, the cup, and the position of each item on the table. By providing SoM, we give Magma a high-level hint of “what needs attention”—the essential elements of the task—without yet specifying the order or method.
Set-of-Mark prompting enables effective action grounding in images for both UI screenshot , robot manipulation and human video by having the model predict numeric marks for clickable buttons or robot arms in image space. These marks give Magma a high-level hint of “what needs attention” – the essential elements of the task.
Figure 2: Set-of-Mark (SoM) for Action Grounding. Set-of-Mark prompting enables effective action grounding in images for both UI screenshot (left), robot manipulation (middle) and human video (right) by having the model predict numeric marks for clickable buttons or robot arms in image space. These marks give Magma a high-level hint of “what needs attention” – the essential elements of the task 
  • Trace-of-Mark (ToM): In ToM we extend the strategy of “overlaying marks” from static images to dynamic videos, by incorporating tracing lines following object movements over time. While SoM highlights key objects or interface elements relevant to a task, ToM captures how these elements change or move throughout an interaction. For example, in a physical task like moving an object on a table, ToM might illustrate the motion of a hand placing the object and adjusting its position. By providing these temporal traces, ToM offers Magma a richer understanding of how actions unfold, complementing SoM’s focus on what needs attention.
Trace-of-Mark (ToM) for Action Planning. Trace-of-Mark supervisions for robot manipulation and human action. It compels the model to comprehend temporal video dynamics and anticipate future states before acting, while using fewer tokens than next-frame prediction to capture longer temporal horizons and action-related dynamics without ambient distractions
Figure 3: Trace-of-Mark (ToM) for Action Planning. Trace-of-Mark supervisions for robot manipulation (left) and human action (right). It compels the model to comprehend temporal video dynamics and anticipate future states before acting, while using fewer tokens than next-frame prediction to capture longer temporal horizons and action-related dynamics without ambient distractions. 

Performance and evaluation

Zero-shot agentic intelligence

Table 1: Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. In this experiment, Magma is the only model that can conduct the full task spectrum.
Table 1: Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. In this experiment, Magma is the only model that can conduct the full task spectrum.
Zero-shot evaluation on Google Robots and Bridge with SimplerEnv. Magma shows strong zero-shot cross-domain robustness and demonstrates impressive results in cross-embodiment manipulation simulation tasks.
Figure 4: Zero-shot evaluation on Google Robots and Bridge with SimplerEnv. Magma shows strong zero-shot cross-domain robustness and demonstrates impressive results in cross-embodiment manipulation simulation tasks.

Efficient finetuning

Table showing Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. In this experiment, Magma is the only model that can conduct the full task spectrum.
Table 2: Efficient finetuning on Mind2Web for web UI navigation.
Figure 5: Few-shot finetuning on Widow-X robot (left) and LIBERO (right). Magma achieves a significantly higher average success rate in all task suites. Additionally, removing SoM and ToM during pretraining has a negative impact on model performance.
Figure 5: Few-shot finetuning on Widow-X robot (left) and LIBERO (right). Magma achieves a significantly higher average success rate in all task suites. Additionally, removing SoM and ToM during pretraining has a negative impact on model performance.
Without task-specific data, Magma performs competitively and even outperforms some state-of-the-art approaches such as Video-Llama2 and ShareGPT4Video on most benchmarks, despite using much fewer video instruction tuning data.
Table 3: Without task-specific data, Magma performs competitively and even outperforms some state-of-the-art approaches such as Video-Llama2 and ShareGPT4Video on most benchmarks, despite using much fewer video instruction tuning data.

Relation to broader research

Magma is one component of a much larger vision within Microsoft Research for the future of agentic AI systems. Across various teams and projects at Microsoft, we are collectively exploring how AI systems can detect, analyze, and respond in the world to amplify human capabilities.

Earlier this month, we announced AutoGen v0.4, a fully reimagined open-source library for building advanced agentic AI systems. While AutoGen focuses on the structure and management of AI agents, Magma enhances those agents by empowering them with a new level of capability. Developers can already use AutoGen to set up an AI assistant that leverages an LLM for planning and dialogue using conventional LLMs. Now with MAGMA, if developers want to build agents that execute both physical or user interface/browser tasks, that same assistant would call upon Magma to understand the environment, perform reasoning, and take a sequence of actions to complete the task. 

The reasoning ability of Magma can be further developed by incorporating test-time search and reinforcement learning, as described in ExACT. ExACT shows an approach for teaching AI agents to explore more effectively, enabling them to intelligently navigate their environments, gather valuable information, evaluate options, and identify optimal decision-making and planning strategies.

At the application level, we are also exploring new user experience (UX) powered by foundation models for the next generation of agentic AI systems. Data Formulator is a prime example. Announced late last year, Data Formulator, is an AI-driven visualization tool developed by Microsoft Research that translates high-level analytical intents into rich visual representations by handling complex data transformations behind the scenes​.  

Looking ahead, the integration of reasoning, exploration and action capabilities will pave the way for highly capable, robust agentic AI systems.

Magma is available on Azure AI Foundry Labs (opens in new tab) as well as on HuggingFace (opens in new tab) with an MIT license. Please refer to the Magma project page (opens in new tab) for more technical details. We invite you to test and explore these cutting-edge agentic model innovations from Microsoft Research.

The post Magma: A foundation model for multimodal AI agents across digital and physical worlds appeared first on Microsoft Research.

Read More

Exploring the structural changes driving protein function with BioEmu-1

Exploring the structural changes driving protein function with BioEmu-1

The image shows eight different 3D models of protein structures. Each model is color-coded with various segments in blue, green, orange, and other colors to highlight different parts of the protein.

From forming muscle fibers to protecting us from disease, proteins play an essential role in almost all biological processes in humans and other life forms alike. There has been extraordinary progress in recent years toward better understanding protein structures using deep learning, enabling the accurate prediction of protein structures from their amino acid sequences. However, predicting a single protein structure from its amino acid sequence is like looking at a single frame of a movie—it offers only a snapshot of a highly flexible molecule. Biomolecular Emulator-1 (BioEmu-1) is a deep-learning model that provides scientists with a glimpse into the rich world of different structures each protein can adopt, or structural ensembles, bringing us a step closer to understanding how proteins work. A deeper understanding of proteins enables us to design more effective drugs, as many medications work by influencing protein structures to boost their function or prevent them from causing harm.

One way to model different protein structures is through molecular dynamics (MD) simulations. These tools simulate how proteins move and deform over time and are widely used in academia and industry. However, in order to simulate functionally important changes in structure, MD simulations must be run for a long time. This is a computationally demanding task and significant effort has been put into accelerating simulations, going as far as designing custom computer architectures (opens in new tab). Yet, even with these improvements, many proteins remain beyond what is currently possible to simulate and would require simulation times of years or even decades. 

Enter BioEmu-1 (opens in new tab)—a deep learning model that can generate thousands of protein structures per hour on a single graphics processing unit. Today, we are making BioEmu-1 open-source (opens in new tab), following our preprint (opens in new tab) from last December, to empower protein scientists in studying structural ensembles with our model. It provides orders of magnitude greater computational efficiency compared to classical MD simulations, thereby opening the door to insights that have, until now, been out of reach. BioEmu-1 is featured in Azure AI Foundry Labs (opens in new tab), a hub for developers, startups, and enterprises to explore groundbreaking innovations from research at Microsoft.

on-demand event

Microsoft Research Forum Episode 4

Learn about the latest multimodal AI models, advanced benchmarks for AI evaluation and model self-improvement, and an entirely new kind of computer for AI inference and hard optimization.


We have enabled this by training BioEmu-1 on three types of data sets: (1) AlphaFold Database (AFDB) (opens in new tab) structures (2) an extensive MD simulation dataset, and (3) an experimental protein folding stability dataset (opens in new tab). Training BioEmu-1 on the AFDB structures is like mapping distinct islands in a vast ocean of possible structures. When preparing this dataset, we clustered similar protein sequences so that BioEmu-1 can recognize that a protein sequence maps to multiple distinct structures. The MD simulation dataset helps BioEmu-1 predict physically plausible structural changes around these islands, mapping out the plethora of possible structures that a single protein can adopt. Finally, through fine-tuning on the protein folding stability dataset, BioEmu-1 learns to sample folded and unfolded structures with the right probabilities.

Figure 1: BioEmu-1 predicts diverse structures of LapD protein unseen during training. We sampled structures independently and reordered the samples to create a movie connecting two experimentally known structures.

Combining these advances, BioEmu-1 successfully generalizes to unseen protein sequences and predicts multiple structures. In Figure 1, we show that BioEmu-1can predict structures of the LapD protein (opens in new tab) from Vibrio cholerae bacteria, which causes cholera. BioEmu-1 predicts structures of LapD when it is bound and unbound with c-di-GMP molecules, both of which are experimentally known but not in the training set. Furthermore, our model offers a view on intermediate structures, which have never been experimentally observed, providing viable hypotheses about how this protein functions. Insights into how proteins function pave the way for further advancements in areas like drug development.

The figure compares Molecular Dynamics (MD) simulation and BioEmu-1, and shows that BioEmu-1 can emulate the equilibrium distribution 100,000 times faster than running a MD simulation to full convergence. The middle part of the figure shows that the 2D projections of the structure distributions obtained from MD simulation and BioEmu-1 are nearly identical. The bottom part of the figure shows three representative structures from the equilibrium distribution.
Figure 2: BioEmu-1 reproduces the D. E. Shaw research (DESRES) simulation of Protein G accurately with a fraction of the computational cost. On the top, we compare the distributions of structures obtained by extensive MD simulation (left) and independent sampling from BioEmu-1 (right). Three representative sample structures are shown at the bottom.

Moreover, BioEmu-1 reproduces MD equilibrium distributions accurately with a tiny fraction of the computational cost. In Figure 2, we compare 2D projections of the structural distribution of D. E. Shaw research (DESRES) simulation of Protein G (opens in new tab) and samples from BioEmu-1. BioEmu-1 reproduces the MD distribution accurately, while requiring 10,000-100,000 times fewer GPU hours.

The left panel of the figure shows a scatter plot of the experimental folding free energies ΔG against those predicted by BioEmu-1. The plot shows a good correlation between the two. The right panel of the figure shows folded and unfolded structures of a protein.
Figure 3: BioEmu-1 accurately predicts protein stability. On the left, we plot the experimentally measured free energy differences ΔG against those predicted by BioEmu-1. On the right, we show a protein in folded and unfolded structures.

Furthermore, BioEmu-1 accurately predicts protein stability, which we measure by computing the folding free energies—a way to quantify the ratio between the folded and unfolded states of a protein. Protein stability is an important factor when designing proteins, e.g., for therapeutic purposes. Figure 3 shows the folding free energies predicted by BioEmu-1, obtained by sampling protein structures and counting folded versus unfolded protein structures, compared against experimental folding free energy measurements. We see that even on sequences that BioEmu-1 has never seen during training, the predicted free energy values correlate well with experimental values.

Professor Martin Steinegger (opens in new tab) of Seoul National University, who was not part of the study, says “With highly accurate structure prediction, protein dynamics is the next frontier in discovery. BioEmu marks a significant step in this direction by enabling blazing-fast sampling of the free-energy landscape of proteins through generative deep learning.”

We believe that BioEmu-1 is a first step toward generating the full ensemble of structures that a protein can take. In these early days, we are also aware of its limitations. With this open-source release, we hope scientists will start experimenting with BioEmu-1, helping us carve out its potentials and shortcomings so we can improve it in the future. We are looking forward to hearing how it performs on various proteins you care about.

Acknowledgements

BioEmu-1 is the result of highly collaborative team effort at Microsoft Research AI for Science. The full authors: Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew Y. K. Foong, Victor García Satorras, Osama Abdin, Bastiaan S. Veeling, Iryna Zaporozhets, Yaoyi Chen, Soojung Yang, Arne Schneuing, Jigyasa Nigam, Federico Barbero, Vincent Stimper, Andrew Campbell, Jason Yim, Marten Lienen, Yu Shi, Shuxin Zheng, Hannes Schulz, Usman Munir, Ryota Tomioka, Cecilia Clementi, Frank Noé

The post Exploring the structural changes driving protein function with BioEmu-1 appeared first on Microsoft Research.

Read More

Introducing Muse: Our first generative AI model designed for gameplay ideation

Introducing Muse: Our first generative AI model designed for gameplay ideation

Three white gaming icons on a green and blue gradient background.

Today, the journal Nature (opens in new tab) is publishing our latest research, which introduces the first World and Human Action Model (WHAM). The WHAM, which we’ve named “Muse,” is a generative AI model of a video game that can generate game visuals, controller actions, or both.

The paper in Nature offers a detailed look at Muse, which was developed by the Microsoft Research Game Intelligence (opens in new tab) and Teachable AI Experiences (opens in new tab) (Tai X) teams in collaboration with Xbox Games Studios’ Ninja Theory (opens in new tab). Simultaneously, to help other researchers explore these models and build on our work, we are open sourcing the weights and sample data and making the executable available for the WHAM Demonstrator—a concept prototype that provides a visual interface for interacting with WHAM models and multiple ways of prompting the models. Developers can learn and experiment with the weights, sample data, and WHAM Demonstrator on Azure AI Foundry (opens in new tab)

In our research, we focus on exploring the capabilities that models like Muse need to effectively support human creatives. I’m incredibly proud of our teams and the milestone we have achieved, not only by showing the rich structure of the game world that a model like Muse can learn, as you see in the video demo below, but also, and even more importantly, by demonstrating how to develop research insights to support creative uses of generative AI models.

Generated gameplay examples

10 seconds video generated by Muse. The character Gizmo from the game Bleeding Edge is attacking an enemy player, jumps forward, and then turns around.
10 seconds video generated by Muse. The character Daemon from the game Bleeding Edge destroys a Cannister, and then collects the Power Cell within. Daemon then mounts their hoverboard and moves towards another set of Cannisters to destroy them.
10 seconds video generated by Muse. The character Gizmo from the game Bleeding edge is moving forward on a hoverboard towards a group of enemies.
10 seconds video generated by Muse. The character Zero Cool from the game Bleeding edge is moving forward up a set of stairs towards a group of enemies. They then activate their ability to jump up to a higher platform.
10 seconds video generated by Muse. The character Nidhoggr from the game Bleeding edge is navigating through the game map.
10 seconds video generated by Muse. The character Makuto from the game Bleeding edge is being healed by an ally whilst they dash forwards.
10 seconds video generated by Muse. The character Miko from the game Bleeding edge is on a hoverboard moving towards a group of Cannisters.
10 seconds video generated by Muse. The character Buttercup from the game Bleeding edge is attacking players from the opposing team.
10 seconds video generated by Muse. The character Makuto from the game Bleeding edge is fleeing from a fight with enemy players.
Example gameplay sequences generated by Muse (based on WHAM-1.6B) demonstrate that our model can generate complex gameplay sequences that are consistent over several minutes. All examples shown here were generated by prompting the model with 10 initial frames (1 second) of human gameplay and the controller actions of the whole play sequence. Muse is used in “world model mode” meaning that it is used to predict how the game will evolve from the initial prompt sequence. The more closely the generated gameplay sequence resembles the actual game, the more accurately Muse has captured the dynamics of that game.

What motivated this research?

As we release our research insights and model today, I keep thinking back to how this all started.  There was a key moment back in December 2022 that I remember clearly. I had recently returned from maternity leave, and while I was away the machine learning world had changed in fundamental ways. ChatGPT had been publicly released, and those who had tried it were in awe of OpenAI’s technical achievements and the model’s capabilities. It was a powerful demonstration of what transformer-based generative models could do when trained on large amounts of (text) data. Coming back from leave at that moment, the key question on my mind was, “What are the implications of this achievement for our team’s work at the intersection of artificial intelligence and video games?”

A new research opportunity enabled by data

In our team, we had access to a very different source of data. For years, we had collaborated with Xbox Game Studios’ Ninja Theory (based in Cambridge, UK, just like our research team) to collect gameplay data from Bleeding Edge, their 2020 Xbox game. Bleeding Edge is a 4-versus-4 game where all games are played online, and matches are recorded if the player agrees to the End User License Agreement (EULA). We worked closely with our colleagues at Ninja Theory and with Microsoft compliance teams to ensure that the data was collected ethically and used responsibly for research purposes.

“It’s been amazing to see the variety of ways Microsoft Research has used the Bleeding Edge environment and data to explore novel techniques in a rapidly moving AI industry,” said Gavin Costello, technical director at Ninja Theory. “From the hackathon that started it all, where we first integrated AI into Bleeding Edge, to building AI agents that could behave more like human players, to the World and Human Action Model being able to dream up entirely new sequences of Bleeding Edge gameplay under human guidance, it’s been eye-opening to see the potential this type of technology has.” 

Muse Training Data

Current Muse instances were trained on human gameplay data (visuals and controller actions) from the Xbox game Bleeding Edge – shown here at the 300×180 px resolution at which we train current models. Muse (using WHAM-1.6B) has been trained on more than 1 billion images and controller actions, corresponding to over 7 years of continuous human gameplay.
The Game Intelligence and Teachable AI Experiences teams playing the Bleeding Edge game together.

Until that point in late 2022, we had used Bleeding Edge as a platform for human-like navigation experiments, but we had not yet made meaningful use of the large amount of human player data we now had available. With the powerful demonstration of text-models, the next question was clear: “What could we achieve if we trained a transformer-based model on large amounts of human gameplay data?” 

Scaling up model training

As the team got to work, some of the key challenges included scaling up the model training. We initially used a V100 cluster, where we were able to prove out how to scale up to training on up to 100 GPUs; that eventually paved the way to training at scale on H100s. Key design decisions we made early focused on how to best leverage insights from the large language model (LLM) community and included choices such as how to effectively represent controller actions and especially images.

The first sign that the hard work of scaling up training was paying off came in the form of a demo that thoroughly impressed me. Tim Pearce, at that time a researcher in Game Intelligence, had put together examples of what happened early versus later in training. You can see the demo here – it’s like watching the model learn. This led to our follow-up work showing how scaling laws emerge in these kinds of models.

Muse consistency over the course of training

Ground truth
Human gameplay
Game visuals generated by Muse with 206M parameters
Conditioned on 1 second of real gameplay and 9 seconds of actions
Original 10k training updates 100k training updates 1M training updates
Character recognizable ✔ ✔ ✔
Basic movements and geometry​ ✔ ✔ ✔
No degeneration over time​ ✔ ✔
Correct interaction with power cell​ ✔
Models flying mechanic correctly​ ✔
Comparing ground truth human gameplay (left) to visuals generated using Muse (using WHAM-206M) when prompted with 1 second of human gameplay (visuals and controller actions) and 9 seconds of controller actions from the ground truth. In this setting, if Muse can generate visuals that closely match the ground truth, then it has captured the game dynamics. We see that the quality of generated visuals improves visibly over the course of training. In early training (10k training updates) we see signs of life, but quality deteriorates quickly. After 100k training updates, the model is consistent over time but does not yet capture relatively less frequent aspects of the game dynamics, such as the flying mechanic. Consistency with the ground truth continues to improve with additional training, e.g., the flying mechanic is captured after 1M training updates.

Multidisciplinary collaboration: Involving users from the beginning

We had started to investigate how to evaluate these types of models early on. For example, we wanted to understand the representations learned using linear probing, which was driven by Research Intern Gunshi Gupta and Senior Research Scientist Sergio Valcarcel Macua; to explore online evaluation, driven by Senior Research Scientist Raluca Georgescu; and to generate both visuals and actions, initially termed “full dreaming” and driven by Research Intern Tarun Gupta. But working through how to systematically evaluate Muse required a much broader set of insights. More importantly, we needed to understand how people might use these models in order to know how to evaluate them.  

This was where the opportunity for multidisciplinary research became crucial. We had discussed aspects of this work with Senior Principal Research Manager Cecily Morrison and her Teachable AI Experiences team for several months. And we had already partnered on an engagement with game creatives (driven by Cecily, Design Researcher Linda Wen, and Principal Research Software Development Engineer Martin Grayson) to investigate how game creators would like to use generative AI capabilities in their creative practice.

“It was a great opportunity to join forces at this early stage to shape model capabilities to suit the needs of creatives right from the start, rather than try to retrofit an already developed technology,” Cecily said. 

Linda offered some valuable insights about how we approached the work: “We’ve seen how technology-driven AI innovation has disrupted the creative industry—often catching creators off guard and leaving many feeling excluded,” she said. “This is why we invited game creators to help us shape this technology from the start. Recognizing that most AI innovations are developed in the Global North, we also made it a priority to recruit game creators from underrepresented backgrounds and geographies. Our goal was to create a technology that benefits everyone—not just those already in positions of privilege.” 

Unlocking new creative use cases with the WHAM Demonstrator

Now, with the model’s emerging capabilities and user insights in mind, it was time to put all the pieces together. The teams joined forces during a Microsoft internal hackathon to explore new interaction paradigms and creative uses that Muse could unlock. As a result, we developed a prototype that we call the WHAM Demonstrator, which allows users to directly interface with the model.

“The Global Hackathon was the perfect opportunity for everyone to come together and build our first working prototype,” Martin said. “We wanted to develop an interface for the WHAM model that would allow us to explore its creative potential and start to test ideas and uses we had learned from our interviews with game developers.” 

WHAM Demonstrator

For interacting with World and Human Action Models like Muse, the WHAM Demonstrator provides a visual interface for interacting with a WHAM instance.

In this example, the user is loading a visual as an initial prompt to the model, here a single promotional image for the game Bleeding Edge. They use Muse to generate multiple potential continuations from this starting point.
The user explores the generated sequences and can tweak them, for example using a game controller to direct the character. These features demonstrate how Muse’s capabilities can enable iteration as part of the creative process.

Identifying key capabilities and how to evaluate them

The hands-on experience of exploring Muse capabilities with the WHAM Demonstrator, and drawing on insights we gained from the user study, allowed us to systematically identify capabilities that game creatives would require to use generative models like Muse. This in turn allowed us to establish evaluation protocols for three key capabilities: consistency, diversity, and persistency. Consistency refers to a model’s ability to generate gameplay sequences that respect the dynamics of the game. For example, the character moves consistently with controller actions, does not walk through walls, and generally reflects the physics of the underlying game. Diversity refers to a model’s ability to generate a range of gameplay variants given the same initial prompt, covering a wide range of ways in which gameplay could evolve. Finally, persistency refers to a model’s ability to incorporate (or “persist”) user modifications into generated gameplay sequences, such as a character that is copy-pasted into a game visual. We give an overview of these capabilities below. 

Muse evaluation of consistency, diversity and persistency

Consistency

We evaluate consistency by prompting the model with ground truth gameplay sequences and controller actions, and letting the model generate game visuals. The videos shown here are generated using Muse (based on WHAM-1.6B) and demonstrate the model’s ability to generate consistent gameplay sequences of up to two minutes. In our paper, we also compare the generated visuals to the ground truth visuals using FVD (Fréchet Video Distance), an established metric in the video generation community.

Diversity

Muse (based on WHAM-1.6B) generated examples of behavioral and visual diversity, conditioned on the same initial 10 frames (1 second) of real gameplay. The three examples at the top show behavioral diversity (diverse camera movement, loitering near the spawn location, and navigating various paths to the middle jump pad). The three examples below show visual diversity (different hoverboards for the character). In the paper, we also quantitatively assess diversity using the Wasserstein distance, a measure of distance between two distributions, to compare the model-generated sequences to the diversity reflected in human gameplay recordings. Muse generated examples of behavioral and visual diversity, conditioned on the same 10 frames of real gameplay. Three examples of behavioral diversity show diverse camera movement, loitering near the spawn location, and navigating various paths to the middle jump pad. Three examples of visual diversity show different hoverboards for the character.

With our evaluation framework in place, and access to an H100 compute allocation, the team was able to further improve Muse instances, including higher resolution image encoders (our current models generate visuals at a resolution of 300×180 pixels, up from the 128×128 resolution of our earliest models) and larger models, and expand to all seven Bleeding Edge maps. To show some of the capabilities of the model we are publishing today, we have included videos of 2-minute-long generated gameplay sequences above, which give an impression of the consistency and diversity of gameplay sequences that the model can generate.

According to Senior Researcher Tabish Rashid: “Being handed an allocation of H100s was initially quite daunting, especially in the early stages figuring out how to make best use of it to scale to larger models with the new image encoders. After months of experimentation, it was immensely rewarding to finally see outputs from the model on a different map (not to knock the lovely greenery of Skygarden) and not have to squint so much at smaller images. I’m sure at this point many of us have watched so many videos from Muse that we’ve forgotten what the real game looks like.”

One of my favorite capabilities of the model is how it can be prompted with modifications of gameplay sequences and persist newly introduced elements. For example, in the demo below, we’ve added a character onto the original visual from the game. Prompting the model with the modified visual, we can see how the model “persists” the added character and generates plausible variants of how the gameplay sequence could have evolved from this modified starting point.

Persistency

Demonstrations of how Muse (based on WHAM-1.6B) can persist modifications. A visual is taken from the original gameplay data and an image of an additional character is edited into the image. The generated gameplay sequence shows how the character is adapted into the generated gameplay sequence.

Conclusion

Today, our team is excited to be publishing our work in Nature and simultaneously releasing Muse open weights, the WHAM Demonstrator, and sample data to the community.

I look forward to seeing the many ways in which the community will explore these models and build on our research. I cannot wait to see all the ways that these models and subsequent research will help shape and increase our understanding of how generative AI models of human gameplay may support gameplay ideation and pave the way for future, novel, AI-based game experiences, including the use cases that our colleagues at Xbox (opens in new tab) have already started to explore.

The post Introducing Muse: Our first generative AI model designed for gameplay ideation appeared first on Microsoft Research.

Read More