AIOpsLab: Building AI agents for autonomous clouds

AIOpsLab: Building AI agents for autonomous clouds

graphical user interface, application, icon

In our increasingly complex digital landscape, enterprises and cloud providers face significant challenges in the development, deployment, and maintenance of sophisticated IT applications. The broad adoption of microservices and cloud-based serverless architecture has streamlined certain aspects of application development while simultaneously introducing a host of operational difficulties, particularly in fault diagnosis and mitigation. These complexities can result in outages, which have the potential to cause major business disruptions, underscoring the critical need for robust solutions that ensure high availability and reliability in cloud services. As the expectation for five-nines availability grows, organizations must navigate the intricate web of operational demands to maintain customer satisfaction and business continuity. 

To tackle these challenges, recent research on using AIOps agents for cloud operations—such as AI agents for incident root cause analysis (RCA) or triaging—has relied on proprietary services and datasets. Other prior works use frameworks specific to the solutions that they are building, or ad hoc and static benchmarks and metrics that fail to capture the dynamic nature of real-world cloud services. Users developing agents for cloud operations tasks with Azure AI Agent Service can evaluate and improve them using AIOpsLab. Furthermore, current approaches do not agree on standard metrics or a standard taxonomy for operational tasks. This calls for a standardized and principled research framework for building, testing, comparing, and improving AIOps agents. The framework should allow agents to interact with realistic service operation tasks in a reproducible manner. It must be flexible in extending to new applications, workloads, and faults. Importantly, it should go beyond just evaluating the AI agents and enabling users to improve the agents themselves; for example, by providing sufficient observability and even serving as a training environment (“gym”) to generate samples to learn on.  

We developed AIOpsLab, a holistic evaluation framework for researchers and developers, to enable the design, development, evaluation, and enhancement of AIOps agents, which also serves the purpose of reproducible, standardized, interoperable, and scalable benchmarks. AIOpsLab is open sourced at GitHub (opens in new tab) with the MIT license, so that researchers and engineers can leverage it to evaluate AIOps agents at scale. The AIOpsLab research paper has been accepted at SoCC’24 (the annual ACM Symposium on Cloud Computing). 

Flowchart of an AIOpsLab system. The chart is divided into four main sections: AIOps Tasks, Orchestrator, Problem Cache, and Service. AIOps Tasks list various applications like SocialNetwork, HotelReservation, E-Commerce, and others, each with associated Data, Actions, Metrics. These tasks connect to the Orchestrator. The Orchestrator is the central element and interacts with various components: it receives a Problem Query Q, detailing Problem, Task T, Workload W, Fault F, and Solution S. It is responsible for deploying or running the workload and injecting faults, as well as taking actions based on the Service State relayed by an Agent. The Problem Cache connects to a Workload Generator and a Fault Generator, creating Workload W for the Service. The Service component shows observability through Traces, Metrics, and Logs. It communicates with the Orchestrator to provide service state updates. The components are connected with arrows that indicate the flow of data and control between each part of the system.
Figure 1. System architecture of AIOpsLab. 

Agent-cloud interface (ACI)

AIOpsLab strictly separates the agent and the application service using an intermediate orchestrator. It provides several interfaces for other system parts to integrate and extend. First, it establishes a session with an agent to share information about benchmark problems: (1) the problem description, (2) instructions (e.g., response format), and (3) available APIs to call as actions.

The APIs are a set of documented tools, e.g., get logs, get metrics, and exec shell, designed to help the agent solve a task. There are no restrictions on the agent’s implementation; the orchestrator poses problems and polls it for the next action to perform given the previous result. Each action must be a valid API call, which the orchestrator validates and carries out. The orchestrator has privileged access to the deployment and can take arbitrary actions (e.g., scale-up, redeploy) using appropriate tools (e.g., helm, kubectl) to resolve problems on behalf of the agent. Lastly, the orchestrator calls workload and fault generators to create service disruptions, which serve as live benchmark problems. AIOpsLab provides additional APIs to extend to new services and generators. 


Example shows how to onboard an agent to AIOpsLab

from aiopslab import Orchestrator
class Agent:
    def __init__(self, prob, instructs, apis):
        self.prompt = self.set_prompt(prob, instructs, apis)
        self.llm = GPT4()

    async def get_action(self, state: str) -> str:
        return self.llm.generate(self.prompt + state)

#initialize the orchestrator
orch = Orchestrator()
pid = "misconfig_app_hotel_res-mitigation-1"
prob_desc, instructs, apis = orch.init_problem(pid)

#register and evaluate the agent
agent = Agent(prob_desc, instructs, apis)
orch.register_agent(agent, name="myAgent")
asyncio.run(orch.start_problem(max_steps=10))

Service

AIOpsLab abstracts a diverse set of services to reflect the variance in production environments. This includes live, running services that are implemented using various architectural principles, including microservices, serverless, and monolithic.

We also leverage open-sourced application suites such as DeathStarBench as they provide artifacts, like source code and commit history, along with run-time telemetry. Adding tools like BluePrint can help AIOpsLab scale to other academic and production services. 

Workload generator

The workload generator in AIOpsLab plays a crucial role by creating simulations of both faulty and normal scenarios. It receives specifications from the orchestrator, such as the task, desired effects, scale, and duration. The generator can use a model trained on real production traces to generate workloads that align with these specifications. Faulty scenarios may simulate conditions like resource exhaustion, exploit edge cases, or trigger cascading failures, inspired by real incidents. Normal scenarios mimic typical production patterns, such as daily activity cycles and multi-user interactions. When various characteristics (e.g., service calls, user distribution, arrival times) can lead to the desired effect, multiple workloads can be stored in the problem cache for use by the orchestrator. In coordination with the fault generator, the workload generator can also create complex fault scenarios with workloads.  

Fault generator

AIOpsLab has a novel push-button fault generator designed for generic applicability across various cloud scenarios. Our approach integrates application and domain knowledge to create adaptable policies and “oracles” compatible with AIOps scenarios. This includes fine-grained fault injection capable of simulating complex failures inspired by production incidents. Additionally, it can inject faults at various system levels, exposing root causes while maintaining semantic integrity and considering interdependencies between cloud microservices. The fault injector’s versatility can enhance the reliability and robustness of cloud systems by enabling thorough testing and evaluation of AIOps capabilities. 

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.


Observability

AIOpsLab is equipped with an extensible observability layer designed to provide comprehensive monitoring capabilities across various system layers for any AIOps tool. AIOpsLab collects a wide array of telemetry data, including (1) traces from Jaeger detailing the end-to-end paths of requests through distributed systems, (2) application logs formatted and recorded by Filebeat and Logstash, and (3) system metrics monitored by Prometheus. Additionally, AIOpsLab also captures lower-level system information such as syscall logs and cluster information. As mentioned, we handle potential data overload through flexible APIs to tune the telemetry data relevant to the AIOps tools. 

AIOpsLab currently supports four key tasks within the AIOps domain. These tasks include incident detection, localization, root cause diagnosis, and mitigation. Additionally, we support several popular agent frameworks: React, Autogen, and TaskWeaver. Two key insights from the study emphasize the importance of observability and well-designed ACI. Observability is crucial for clear root-cause diagnosis, for instance, pinpointing a misconfigured API gateway can be vital in preventing service downtime.

Flexibility is another critical factor; the ability to execute arbitrary shell commands allowed for effective troubleshooting in real-time scenarios. Lastly, robust error handling is essential—providing agents with high-quality feedback on execution barriers, such as a failed database connection, ensures swift resolution and continuous improvement. 


Example of ACI on diagnosis task

Agent: I need to first look at the logs to understand if there are any errors or warnings that could indicate problems. 

Action: 

``` 

get_logs("compose-post-service", "test-social-network") 

```
Service: [2024-Aug-04 23:18:49.365494] <info>: 

… 

Thrift: Sun Aug 4 23:19:19 2024 TSocket::open() connect() 
<Host: user-service Port: 9090>: Connection refusedAgent: … unable to connect to the `user-service` at port 9090, even though the pod is running. This suggests a possible network issue or misconfiguration in service discovery

Next steps

This research project adopts Microsoft security standards and Responsible AI principles, and we envision this research evolving into a vital resource for organizations aiming to optimize their IT operations. Additionally, we plan to collaborate with various generative AI teams to incorporate AIOpsLab as a benchmark scenario for evaluating state-of-the-art models. By doing so, we aim to foster innovation and encourage the development of more advanced AIOps solutions. This research is essential not only for IT professionals but also for anyone invested in the future of technology, as it has the potential to redefine how organizations manage operations, respond to incidents, and ultimately serve their customers in an increasingly automated world. 

Acknowledgements

We would like to thank Yinfang Chen, Manish Shetty, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, and Suman Nath, for contributing to this project.

The post AIOpsLab: Building AI agents for autonomous clouds appeared first on Microsoft Research.

Read More

Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness

Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness

Illustrated headshots of Ginny Badanes, Madeleine Daepp and Robert Ness

Behind every emerging technology is a great idea propelling it forward. In the Microsoft Research Podcast series Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets.

In 2024, with advancements in generative AI continuing to reach new levels and the world experiencing its “biggest election year in history (opens in new tab),” could there possibly be a better time to examine the technology’s emerging role in global democracies? Inspired by the moment, senior researchers Madeleine Daepp (opens in new tab) and Robert Osazuwa Ness (opens in new tab) conducted research in Taiwan, studying the technology’s influence on disinformation, and in India, documenting its impact on digital communications more broadly. In this episode, Daepp and Ness join guest host Ginny Badanes (opens in new tab), general manager of the Democracy Forward program at Microsoft. They discuss how leveraging commonly understood language such as fraud can help people understand potential risks associated with generative AI; the varied ways in which Daepp and Ness saw the tech being deployed to promote or discredit candidates; and the opportunities for the technology to be a force for fortifying democracy.

Learn more:  

Video will kill the truth if monitoring doesn’t improve, argue two researchers (opens in new tab)
The Economist, March 2024

Microsoft Research Special Projects
Group homepage

Democracy Forward
Program homepage, Microsoft Corporate Social Responsibility

As the US election nears, Russia, Iran and China step up influence efforts (opens in new tab)
Microsoft On the Issues blog, October 2024

Combatting AI Deepfakes: Our Participation in the 2024 Political Conventions (opens in new tab)
Microsoft On the Issues blog, July 2024

China tests US voter fault lines and ramps AI content to boost its geopolitical interests (opens in new tab)
Microsoft On the Issues, April 2024

Project Providence (opens in new tab)
Project homepage

Transcript

[TEASER] [MUSIC PLAYS UNDER DIALOGUE]

MADELEINE DAEPP: Last summer, I was working on all of these like pro-democracy applications, trying to build out, like, a social data collection tool with AI, all this kind of stuff. And I went to the elections workshop that the Democracy Forward team at Microsoft had put on, and Dave Leichtman, who, you know, was the MC of that work, was really talking about how big of a global elections year 2024 was going to be. Over 70 countries around the world. And, you know, we’re coming from Microsoft Research, where we were so excited about this technology. And then, all of a sudden, I was at the elections workshop, and I thought, oh no, [LAUGHS] like, this is not good timing.

ROBERT OSAZUWA NESS: What are we really talking about in the context of deepfakes in the political context, elections context? It’s deception, right. I’m trying to use this technology to, say, create some kind of false record of events in order to convince people that something happened that actually did not happen. And so that goal of deceiving, of creating a false record, that’s kind of how I have been thinking about deepfakes in contrast to the broader category of generative AI.

[TEASER ENDS]

GINNY BADANES: Welcome to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.


[MUSIC FADES]

I’m your guest host, Ginny Badanes, and I lead Microsoft’s Democracy Forward program, where we’ve spent the past year deeply engaged in supporting democratic elections around the world, including the recent US elections. We have been working on everything from raising awareness of nation-state propaganda efforts to helping campaigns and election officials prepare for deepfakes to protecting political campaigns from cyberattacks. Today, I’m joined by two researchers who have also been diving deep into the impact of generative AI on democracy.

Microsoft senior researchers Madeleine Daepp and Robert Osazuwa Ness are studying generative AI’s influence in the political sphere with the goal of making AI systems more robust against misuse while supporting the development of AI tools that can strengthen democratic processes and systems. They spent time in Taiwan and India earlier this year, where both had big democratic elections. Madeleine and Robert, welcome to the podcast!

MADELEINE DAEPP: Thanks for having us.

ROBERT OSAZUWA NESS: Thanks for having us.

BADANES: So I have so many questions for you all—from how you conducted your research to what you’ve learned—and I’m really interested in what you think comes next. But first, let’s talk about how you got involved in this in the first place. Could you both start by telling me a little bit about your backgrounds and just what got you into AI research in the first place?

DAEPP: Sure. So I’m a senior researcher here at Microsoft Research in the Special Projects team. But I did my PhD at MIT in urban studies and planning. And I think a lot of folks hear that field and think, oh, you know, housing, like upzoning housing and figuring out transportation systems. But it really is a field that’s about little “d” democracy, right. About how people make choices about shared public spaces every single day. You know, I joined Microsoft first off to run this, sort of, technology deployment in the city of Chicago, running a low-cost air-quality-sensor network for the city. And when GPT-4 came out, you know, first ChatGPT, and then we, sort of, had this big recognition of, sort of, how well this technology could do in summarizing and in representing opinions and in making sense of big unstructured datasets, right. I got actually very excited. Like, I thought this could be used for town planning processes. [LAUGHS] Like, I thought we could … I had a whole project with a wonderful intern, Eva Maxfield Brown, looking at, can we summarize planning documents using AI? Can we build out policies from conversations that people have in shared public spaces? And so that was very much the impetus for thinking about how to apply and build things with this amazing new technology in these spaces.

BADANES: Robert, I think your background is a little bit different, yet you guys ended up in a similar place. So how did you get there?

NESS: Yeah, so I’m also on Special Projects, Microsoft Research. My work is focusing on large language models, LLMs. And, you know, so I focus on making these models more reliable and controllable in real-world applications. And my PhD is in statistics. And so I focus a lot on using just basic bread-and-butter statistical methods to try and control and understand LLM behavior. So currently, for example, I’m leading a team of engineers and running experiments designed to find ways to enhance a graphical approach to combining information retrieval in large language models. I work on statistical tests for testing significance of adversarial attacks on these models.

BADANES: Wow.

NESS: So, for example, if you find a way to trick one of these models into doing something it’s not supposed to do, I make sure that it’s not, like, a random fluke; that it’s something that’s reproducible. And I also work at this intersection between generative AI and, you know, Bayesian stuff, causal inference stuff. And so I came at looking at this democracy work through an alignment lens. So alignment is this task in AI of making sure these models align with human values and goals. And what I was seeing was a lot of research in the alignment space was viewing it as a technical problem. And, you know, as a statistician, we’re trained to consult, right. Like, to go to the actual stakeholders and say, hey, what are your goals? What are your values? And so this democracy work was an opportunity to do that in Microsoft Research and connected with Madeleine. So she was planning to go to Taiwan, and kind of from a past life, I wanted to become a trade economist and learned Mandarin. And so I speak fluent Mandarin and seemed like a good matchup of our skill sets …

BADANES: Yeah.

NESS: … and interests. And so that’s, kind of, how we got started.

BADANES: So, Madeleine, you brought the two of you together, but what started it for you? This podcast is all about big ideas. What sparked the big idea to bring this work that you’ve been doing on generative AI into the space of democracy and then to go out and find Robert and match up together?

DAEPP: Yeah, well, Ginny, it was you. [LAUGHS] It was actually your team.

BADANES: I didn’t plant that! [LAUGHS]

DAEPP: So, you know, I think last summer, I was working on all of these like pro-democracy applications, trying to build out, like, a social data collection tool with AI, all this kind of stuff. And I went to the elections workshop that the Democracy Forward team at Microsoft had put on, and Dave Leichtman, who, you know, was the MC of that work, was really talking about how big of a global elections year 2024 was going to be, that this—he was calling it “Votorama.” You know, that term didn’t take off. [LAUGHTER] The term that has taken off is biggest election year in history, right. Over 70 countries around the world. And, you know, we’re coming from Microsoft Research, where we were so excited about this technology. Like, when it started to pass theory of mind tests, right, which is like the ability to think about how other people are thinking, like, we were all like, oh, this is amazing; this opens up so many cool application spaces, right. When it was, like, passing benchmarks for multilingual communication, again, like, we were so excited about the prospect of building out multilingual systems. And then, all of a sudden, I was at the elections workshop, and I thought, oh no, [LAUGHS] this is not good timing.

BADANES: Yeah …

DAEPP: And because so much of my work focuses on, you know, building out computer science systems like, um, data science systems or AI systems but with communities in the loop, I really wanted to go to the folks most affected by this problem. And so I proposed a project to go to Taiwan and to study one of the … it was the second election of 2024. And Taiwan is known to be subject to more external disinformation than any other place in the world. So if you were going to see something anywhere, you would see it there. Also, it has amazing civil society response so really interesting people to talk to. But I do not speak, Chinese, right. Like, I don’t have the context; I don’t speak the language. And so part of my process is to hire a half-local team. We had an amazing interpreter, Vickie Wang, and then a wonderful graduate student, Ti-Chung Cheng, who supported this work. But then also my team, Special Projects, happened to have this person who, like, not only is a leading AI researcher publishing in NeurIPS, like building out these systems, but who also spoke Chinese, had worked in technology security, and had a real understanding of international studies and economics as well as AI. And so for me, like, finding Robert as a collaborator was kind of a unicorn moment.

BADANES: So it sounds like it was a match made in heaven of skill sets and abilities. Before we get into what you all found there, which I do want to get into, I first think it’s helpful—I don’t know, when we’re dealing with these, like, complicated issues, particularly things that are moving and changing really quickly, sometimes I found it’s helpful to agree on definitions and sort of say, this is what we mean when we say this word. And that helps lead to understanding. So while I know that this research is about more than deepfakes—and we’ll talk about some of the things that are more than deepfakes—I am curious how you all define that term and how you think of it. Because this is something that I think is constantly moving and changing. So how have you all been thinking about the definition of that term?

NESS: So I’ve been thinking about it in terms of the intention behind it, right. We say deepfake, and I think colloquially that means kind of all of generative AI. That’s a bit unfortunate because there are things that are … you know, you can use generative AI to generate cartoons …

BADANES: Right.

NESS: … or illustrations for a children’s book. And so in thinking about what are we really talking about in the context of deepfakes in the political context, elections context, it’s deception, right. I’m trying to use this technology to, say, create some kind of false record of events, say, for example, something that a politician says, in order to convince people that something happened that actually did not happen.

BADANES: Right.

NESS: And so that goal of deceiving, of creating a false record, that’s kind of how I have been thinking about deepfakes in contrast to the broader category of generative AI and deepfakes in terms of being a malicious use case. There are other malicious use cases that don’t necessarily have to be deceptive, as well, as well as positive use cases.

BADANES: Well, that really, I mean, that resonates with me because what we found was when you use the term deception—or another term we hear a lot that I think works is fraud—that resonates with other people, too. Like, that helps them distinguish between neutral uses or even positive uses of AI in this space and the malicious use cases, though to your point, I suppose there’s probably even deeper definitions of what malicious use could look like. Are you finding that distinction showing up in your work between fraud and deception in these use cases? Is that something that has been coming through?

DAEPP: You know, we didn’t really think about the term fraud until we started prepping for this interview with you. As Robert said, so much of what we were thinking about in our definition was this representation of people or events, you know, done in order to deceive and with malicious intent. But in fact, in all of our conversations, no matter who we were talking to, no matter what political bent, no matter, you know, national security, fact-checking, et cetera, you know, they all agreed that using AI for the purposes of scamming somebody financially was not OK, right. That’s fraud. Using AI for the purposes of nudifying, like removing somebody’s clothes and then sextorting them, right, extorting them for money out of fear that this would be shared, like, that was not OK. And those are such clear lines. And it was clear that there’s a set of uses of generative AI also in the political space, you know, of saying this person said something that they didn’t, …

BADANES: Mm-hmm.

DAEPP: … of voter suppression, that in general, there’s a very clear line that when it gets into that fraudulent place, when it gets into that simultaneously deceptive and malicious space, that’s very clearly a no-go zone.

NESS: Oftentimes during this research, I found myself thinking about this dichotomy in cybersecurity of state actors, or broadly speaking, kind of, political actors, versus criminals.

BADANES: Right.

NESS: And it’s important to understand the distinction because criminals are typically trying to target targets of opportunity and make money, while state-sponsored agents are willing to spend a lot more money and have very specific targets and have a very specific definition of success. And so, like, this fraud versus deception kind of feels like that a little bit in the sense that fraud is typically associated with criminal behavior, while, say, I might put out deceptive political messaging, but it might fall within the bounds of free speech within my country.

BADANES: Right, yeah.

NESS: And so this is not to say I disagree with that, but it just, actually, that it could be a useful contrast in terms of thinking about the criminal versus the political uses, both legitimate and illegitimate.

BADANES: Well, I also think those of us who work in the AI space are dealing in very complicated issues that the majority of the world is still trying to understand. And so any time you can find a word that people understand immediately in order to do the, sort of, storytelling: the reason that we are worried about deepfakes in elections is because we do not want voters to be defrauded. And that, we find really breaks through because people understand that term already. That’s a thing that they already know that they don’t want to be; they do not want to be defrauded in their personal life or in how they vote. And so that really, I found, breaks through. But as much as I have talked about deepfakes, I know that you—and I know there’s a lot of interest in talking about deepfakes when we talk about this subject—but I know your research goes beyond that. So what other forms of generative AI did you include in your research or did you encounter in the effort that you were doing both in Taiwan and India?

DAEPP: Yeah. So let me tell you just, kind of, a big overview of, like, our taxonomy. Because as you said, like, so much of this is just about finding a word, right. Like, so much of it is about building a shared vocabulary so that we can start to have these conversations. And so when we looked at the political space, right, elections, so much of what it means to win an election is kind of two things. It’s building an image of a candidate, right, or changing the image of your opposition and telling a story, right.

BADANES: Mm-hmm.

DAEPP: And so if you think about image creation, of course, there are deepfakes. Like, of course, there are malicious representations of a person. But we also saw a lot of what we’re calling auth fakes, like authorized fakes, right. Candidates who would actually go to a consultancy and, like, get their bodies scanned so that videos could be made of them. They’d get their voices, a bunch of snippets of their voices, recorded so that then there could be personalized phone calls, right. So these are authorized uses of their image and likeness. Then we saw a term I’ve heard in, sort of, the ether is soft fakes. So again, likenesses of a candidate, this time not necessarily authorized but promotional. They weren’t … people on Twitter—I guess, X—on Instagram, they were sharing images of the candidate that they supported that were really flattering or silly or, you know, just really sort of in support of that person. So not with malicious intent, right, with promotional intent. And then the last one, and this, I think, was Robert’s term, but in this image creation category, you know, one thing we talked about was just the way that people were also making fun of candidates. And in this case, this is a bit malicious, right. Like, they’re making fun of people; they’re satirizing them. But it’s not deceptive because, …

BADANES: Right …

DAEPP: … you know, often it has that hyper-saturated meme aesthetic. It’s very clearly AI or just, you know, per like, sort of, US standards for satire, like, a reasonable person would know that it was silly. And so Robert said, you know, oh, these influencers, they’re not trying to deceive people; like, they’re not trying to lie about candidates. They’re trying to roast them. [LAUGHTER] And so we called it a deep roast. So that’s, kind of, the images of candidates. I will say we also looked at narrative building, and there, one really important set of things that we saw was what we call text to b-roll. So, you know, a lot of folks think that you can’t really make AI videos because, like, Sora isn’t out yet[1]. But in fact, what there is a lot of is tooling to, sort of, use AI to pull from stock imagery and b-roll footage and put together a 90-second video. You know, it doesn’t look like AI; it’s a real video. So text to b- roll, AI pasta? So if you know the threat intelligence space, there’s this thing called copy pasta, where people just …

BADANES: Sure.

DAEPP: … it’s just a fun word for copy-paste. People just copy-paste terms in order to get a hashtag trending. And we talked to an ex-influencer who said, you know, we’re using AI to do this. And I asked him why. And he said, well, you know, if you just do copy-paste, the fact-checkers catch it. But if you use AI, they don’t. And so AI pasta. And there’s also some research showing that this is potentially more persuasive than copy-paste …

BADANES: Interesting.

DAEPP:  … because people think there’s a social consensus. And then the last one, this is my last of the big taxonomy, and, Robert, of course, jump in on anything you want to go deeper on, but Fake News 2.0. You know, I’m sure you’ve seen this, as well. Just this, like, creation of news websites, like entire new newspapers that nobody’s ever heard of. AI avatars that are newscasters. And this is something that was happening before. Like, there’s a long tradition of pretending to be a real news pamphlet or pretending to be a real outlet. But there’s some interesting work out of … Patrick Warren at Clemson has looked at some of these and shown the quality and quantity of articles on these things has gotten a lot better and, you know, improves as a step function of, sort of, when new models come out.

NESS: And then on the flip side, you have people using the same technologies but stated clearly that it’s AI generated, right. So we mentioned the AI avatars. In India, there’s this … there’s Bhoomi, which is a AI news anchor for agricultural news, and it states there in clear terms that she’s not real. But of course, somebody who wanted to be deceptive could use the same technology to portray something that looks like a real news broadcast that isn’t. You know, and, kind of, going back, Madeleine mentioned deep roasts, right, so, kind of, using this technology to create satirical depictions of, say, a political opponent. Somebody, a colleague, sent something across my desk. It was a Douyin account—so Douyin is the version of TikTok that’s used inside China; …

BADANES: OK.

NESS: … same company, but it’s the internal version of TikTok—that was posting AI-generated videos of politicians in Taiwan. And these were excellent, real good-quality AI-generated deepfakes of these politicians. But some of them were, first off, on the bottom of all of them, it said, this is AI-generated content.

BADANES: Oh.

NESS: And some of them were, kind of, obviously meant to be funny and were clearly fake, like still images that were animated to make somebody singing a funny song, for example. A very serious politician singing a very silly song. And it’s a still image. It’s not even, it’s not even …

BADANES: a video.

NESS: …like video.

BADANES: Right, right.

NESS: And so I messaged Puma Shen, who is one of the legislators in Taiwan who was targeted by these attacks, and I said, what do you think about this? And, you know, he said, yeah, they got me. [LAUGHTER] And I said, you know, do you think people believe this? I mean, there are people who are trying to debunk it. And he said, no, our supporters don’t believe it, but, you know, people who support the other side or people who are apolitical, they might believe it, or even if it says it’s fake—they know it’s fake—but they might still say that, yeah, but this is something they would do, right. This is …

BADANES: Yeah, it fits the narrative. Yeah.

NESS: … it fits the narrative, right. And that, kind of, that really, you know, I had thought of this myself, but just hearing somebody, you know, who’s, you know, a politician who’s targeted by these attacks just saying that it’s, like, even if they believe it’s … even if they know it’s fake, they still believe it because it’s something that they would do.

BADANES: Sure.

NESS: That’s, you know, as a form of propaganda, even relative to the canonical idea of deepfake that we have, this could be more effective, right. Like, just say it’s AI and then use it to, kind of, paint the picture of the opponent in any way you like.

BADANES: Sure, and this gets into that, sort of, challenging space I think we find ourselves in right now, which is people don’t know necessarily how to tell what’s real or not. And the case you’re describing, it has labeling, so that should tell you. But a lot of the content we come across online does not have labeling. And you cannot tell just based on your eyes whether images were generated by AI or whether they’re real. One of the things that I get asked a lot is, why can’t we just build good AI to detect bad AI, right? Why don’t we have a solution where I just take a picture and I throw it into a machine and it tells me thumbs-up or thumbs-down if this is AI generated or not? And the question around detection is a really tricky one. I’m curious what you all think about, sort of, the question of, can detection solve this problem or not?

NESS: So I’ll mention one thing. So Madeleine mentioned an application of this technology called text to b-roll. And so what this is, technically speaking, what this is doing is you’re taking real footage, you stick it in a database, it’s quote, unquote “vectorized” into these representations that the AI can understand, and then you say, hey, generate a video that illustrates this narrative for me. And you provide it the text narrative, and then it goes and pulls out a whole bunch of real video from a database and curates them into a short video that you could put on TikTok, for example. So this was a fully AI-generated product, but none of the actual content is synthetic.

BADANES: Ah, right.

NESS: So in that case, your quote, unquote “AI detection tool” is not going to work.

DAEPP: Yeah, I mean, something that I find really fascinating any time that you’re dealing with a sociotechnical system, right—a technical system embedded in social context—is folks, you know, think that things are easy that are hard and things are hard that are easy, right. And so with a lot of the detections work, right, like if you put a deepfake detector out, you make that available to anyone, then what they can do is they can run a bunch of stuff by it, …

BADANES: Yeah.

DAEPP: … add a little bit of random noise, and then the deepfake detector doesn’t work anymore. And so that detection, actually, technically becomes an arms race, you know. And we’re seeing now some detectors that, like, you know, work when you’re not looking at a specific image or a specific piece of text but you’re looking at a lot all at once. That seems more promising. But, just, this is a very, very technically difficult problem, and that puts us as researchers in a really tricky place because, you know, you’re talking to folks who say, why can’t you just solve this? If you put this out, then you have to put the detector out. And we’re like, that’s actually not, that’s not a technically feasible long-term solution in this space. And the solutions are going to be social and regulatory and, you know, changes in norms as well as technical solutions that maybe are about everything outside of AI, right.

BADANES: Yeah.

DAEPP: Not about fixing the AI system but fixing the context within which it’s used.

BADANES: It’s not just a technological solution. There’s more to it. Robert?

NESS: So if somebody were to push back there, they could say, well, great; in the long term, maybe it’s an arms race, but in the short term, right, we can have solutions out there that, you know, at least in the next election cycle, we could maybe prevent some of these things from happening. And, again, kind of harkening back to cybersecurity, maybe if you make it hard enough, only the really dedicated, really high-funded people are going to be doing it rather than, you know, everybody who wants to throw a bunch of deepfakes on the internet. But the problem still there is that it focuses really on video and images, right.

BADANES: Yeah. What about audio?

NESS: What about audio? And what about text? So …

BADANES: Yeah. Those are hard. I feel like we’ve talked a lot about definitions and theoretical, but I want to make sure we talk more about what you guys saw and researched and understood on the ground, in particular, your trips to India and Taiwan and even if you want to reflect on how those compare to the US environment. What did you actually uncover? What surprised you? What was different between those countries?

DAEPP: Yeah, I mean, right, so Taiwan … both of these places are young democracies. And that’s really interesting, right. So like in Taiwan, for example, when people vote, they vote on paper. And anybody can go watch. That’s part of their, like, security strategies. Like, anyone around the world can just come and watch. People come from far. They fly in from Canada and Japan and elsewhere just to watch Taiwanese people vote. And then similarly in India, there’s this rule where you have to be walking distance from your polling place, and so the election takes two months. And, like, your polling places move from place to place, and sometimes, it arrives on an elephant. And so these were really interesting places to, like, I as an American, just, like, found it very, very fascinating to and important to be outside of the American context. You know, we just take for granted that how we do democracy is how other people do it. But Taiwan was very much a joint, like, civil society–government everyday response to this challenge of having a lot of efforts to manipulate public opinion happening with, you know, real-world speeches, with AI, with anything that you can imagine. You know, and I think the Microsoft Threat Analysis Center released a report documenting some of the, sort of, video stuff[2]. There’s a use of AI to create videos the night before the election, things like this. But then India is really thinking of … so India, right, it’s the world’s biggest democracy, right. Like, nearly a billion people were eligible to vote.

BADANES: Yeah.

NESS: And arguably the most diverse, right?

DAEPP: Yeah, arguably the most diverse in terms of languages, contexts. And it’s also positioning itself as the AI laboratory for the Global South. And so folks, including folks at the MSR (Microsoft Research) Bangalore lab, are leaders in thinking about representing low-resource languages, right, thinking about cultural representation in AI models. And so there you have all of these technologists who are really trying to innovate and really trying to think about what’s the next clever application, what’s the next clever use. And so that, sort of, that taxonomy that we talked about, like, I think just every week, every interview, we, sort of, had new things to add because folks there were just constantly trying all different kinds of ways of engaging with the public.

NESS: Yeah, I think for me, in India in particular, you know, India is an engineering culture, right. In terms of, like, the professional culture there, they’re very, kind of, engineering skewed. And so I think one of the bigger surprises for me was seeing people who were very experienced and effective campaign operatives, right, people who would go and, you know, hit the pavement; do door knocking; kind of, segment neighborhoods by demographics and voter block, these people were also, you know, graduated in engineering from an IIT (Indian Institute of Technology), …

BADANES: Sure.

NESS: … right, and so … [LAUGHS]  so they were happy to pick up these tools and leverage them to support their expertise in this work, and so some of the, you know, I think a lot of the narrative that we tell ourselves in AI is how it’s going to be, kind of, replacing people in doing their work. But what I saw in India was that people who were very effective had a lot of domain expertise that you couldn’t really automate away and they were the ones who are the early adopters of these tools and were applying it in ways that I think we’re behind on in terms of, you know, ideas in the US.

BADANES: Yeah, I mean, there’s, sort of, this sentiment that AI only augments existing problems and can enhance existing solutions, right. So we’re not great at translation tools, but AI will make us much better at that. But that also can then be weaponized and used as a tool to deceive people, which propaganda is not new, right? We’re only scaling or making existing problems harder, or adversaries are trying to weaponize AI to build on things they’ve already been doing, whether that’s cyberattacks or influence operations. And while the three of us are in different roles, we do work for the same company. And it’s a large technology company that is helping bring AI to the world. At the same time, I think there are some responsibilities when we look at, you know, bad actors who are looking to manipulate our products to create and spread this kind of deceptive media, whether it’s in elections or in other cases like financial fraud or other ways that we see this being leveraged. I’m curious what you all heard from others when you’ve been doing your research and also what you think our responsibilities are as a big tech company when it comes to keeping actors from using our products in those ways.

DAEPP: You know, when I started using GPT-4, one of the things I did was I called my parents, and I said, if you hear me on a phone call, …

BADANES: Yeah.

DAEPP: … like, please double check. Ask me things that only I would know. And when I walk around Building 99, which is, kind of, a storied building in which a lot of Microsoft researchers work, everybody did that call. We all called our parents.

BADANES: Interesting.

DAEPP: Or, you know, we all checked in. So just as, like, we have a responsibility to the folks that we care about, I think as a company, that same, sort of, like, raising literacy around the types of fraud to expect and how to protect yourself from them—I think that gets back to that fraud space that we talked about—and, you know, supporting law enforcement, sharing what needs to be shared, I think that without question is a space that we need to work in. I will say a lot of the folks we talked with, they were using Llama on a local GPU, right.

BADANES: OK.

DAEPP: They were using open-source models. They were sometimes … they were testing out Phi. They would use Phi, Grok, Llama, like anything like that. And so that raises an interesting question about our guardrails and our safety practices. And I think there, we have an, like, our obligation and our opportunity actually is to set the standard, right. To say, OK, like, you know, if you use local Llama and it spouts a bunch of stuff about voter suppression, like, you can get in trouble for that. And so what does it mean to have a safe AI that wins in the marketplace, right? That’s an AI that people can feel confident and comfortable about using and one that’s societally safe but also personally safe. And I think that’s both a challenge and a real opportunity for us.

BADANES: Yeah … oh, go ahead, Robert, yeah …

NESS: Going back to the point about fraud. It was this year, in January, when that British engineering firm Arup, when somebody used a deepfake to defraud that company of about $25 million, …

BADANES: Yeah.

NESS: … their Hong Kong office. And after that happened, some business managers in Microsoft reached out to me regarding a major client who wanted to start red teaming. And by red teaming, I mean intentionally targeting your executives and employees with these types of attacks in order to figure out where your vulnerabilities as an organization are. And I think, yeah, it got me thinking like, wow, I would, you know, can we do this for my dad? [LAUGHS] Because I think that was actually a theme that came out from a lot of this work, which was, like, how can we empower the people who are really on the frontlines of defending democracy in some of these places in terms of the tooling there? So we talked about, say, AI detection tools, but the people who are actually doing fact-checking, they’re looking more than at just the video or the images; they’re actually looking at a, kind of, holistic … taking a holistic view of the news story and doing some proper investigative journalism to see if something is fake or not.

BADANES: Yeah.

NESS: And so I think as a company who creates products, can we take a more of a product mindset to building tools that support that entire workflow in terms of fact-checking or investigative journalism in the context of democratic outcomes …

BADANES: Yeah.

NESS: … where maybe looking at individual deepfake content is just a piece of that.

BADANES: Yeah, you know, I think there’s a lot of parallels here to cybersecurity. That’s also what we’ve found, is this idea that, first of all, the “no silver bullet,” as we were talking about earlier with the detection piece. Like, you can’t expect your system to be secure just because you have a firewall, right. You have to have this, like, defense in-depth approach where you have lots of different layers. And one of those layers has been on the literacy side, right. Training and teaching people not to click on a phishing link, understanding that they should scroll over the URL. Like, these are efforts that have been taken up, sort of, in a broad societal sense. Employers do it. Big tech companies do it. Governments do it through PSAs and other things. So there’s been a concerted effort to get a population who might not have been aware of the fact that they were about to be scammed to now know not to click on that link. I think, you know, you raised the point about literacy. And I think there’s something to be said about media literacy in this space. It’s both AI literacy—understanding what it is—but also understanding that people may try to defraud you. And whether that is in the political sense or in the financial sense, once you have that, sort of, skill set in place, you’re going to be protected. One thing that I’ve heard, though, as I have conversations about this challenge … I’ve heard a couple things back from people specifically in civil society. One is not to put the impetus too much on the end consumer, which I think I’m hearing that we also recognize there’s things that we as technology companies should be focusing on. But the other thing is the concern that in, sort of, the long run, we’re going to all lose trust in everything we see anyway. And I’ve heard some people refer to that as the trust deficit. Have you all seen anything promising in the space to give you a sense around, can we ever trust what we’re looking at again, or are we actually just training everyone to not believe anything they see? Which I hope is not the case. I am an optimist. But I’d love to hear what you all came across. Are there signs of hope here where we might actually have a place where we can trust what we see again? 

DAEPP: Yeah. So two things. There is this phenomenon called the liar’s dividend, right, … 

BADANES: Sure, yeah.

DAEPP: … which is where that if you educate folks about how AI can be used to create fake clips, fake audio clips, fake videos, then if somebody has a real audio clip, a real video, they can claim that it’s AI. And I think we talk, you know, again, this is, like, in a US-centric space, we talk about this with politicians, but the space in which this is really concerning, I think, is war crimes, right …

BADANES: Oh, yeah.

DAEPP: … I think are these real human rights infractions where you can prevent evidence from getting out or being taken seriously. And we do see that right after invasions, for example, these days. But this is actually a space … like, I just told you, like, oh, like, detection is so hard and not technically, like, that’ll be an arms race! But actually, there is this wonderful project, Project Providence, that is a Microsoft collaboration with a company called Truepic that … it’s, like, an app, right. And what happens is when you take a photo using this app, it encrypts the, you know, hashes the GPS coordinates where the photo was taken, the time, the day, and uploads that with the pixels, with the image, to Azure. And then later, when a journalist goes to use that image, they can see that the pixels are exactly the same, and then they can check the location and they can confirm the GPS. And this actually meets evidentiary standards for the UN human rights tribunal, right.

BADANES: Right.

DAEPP: So this is being used in Ukraine to document war crimes. And so, you know, what if everybody had that app on their phone? That means you don’t … you know, most photos you take, you can use an AI tool and immediately play with. But in that particular situation where you need to confirm provenance and you need to confirm that this was a real event that happened, that is a technology that exists, and I think folks like the C2PA coalition (Coalition for Content Provenance and Authenticity) can make that happen across hardware providers.

NESS: And I think the challenge for me is, we can’t separate this problem from some of the other, kind of, fundamental problems that we have in our media environment now, right. So, for example, if I go on to my favorite social media app and I see videos from some conflicts around the world, and these videos could be not AI generated and I still could be, you know, the target of some PR campaign to promote certain content and suppress other ones. The videos could be authentic videos, but not actually be accurate depictions of what they claim to be. And so I think that this is a … the AI presents a complicating factor in an already difficult problem space. And I think, you know, trying to isolate these different variables and targeting them individually is pretty tricky. I do think that despite the liar’s dividend that media literacy is a very positive area to, kind of, focus energy …

BADANES: Yeah.

NESS: … in the sense that, you know, you mentioned earlier, like, using this term fraud, again, going back to this analogy with cybersecurity and cybercrime, that it tends to resonate with people. We saw that, as well, especially in Taiwan, didn’t we, Madeleine? Well, in India, too, with the sextortion fears. But in Taiwan, a lot of just cybercrime in terms of defrauding people of money. And one of the things that we had observed there was that talking about generative AI in the context of elections was difficult to talk to people about it because people, kind of, immediately went into their political camps, right.

BADANES: Yeah.

NESS: And so you had to, kind of, penetrate … you know, people were trying to, kind of, suss out which side you were on when you’re trying to educate them about this topic.

BADANES: Sure.

NESS: But if you talk to—but everybody’s, like, fraud itself is a lot less partisan.

BADANES: Yeah, it’s a neutral term.

NESS: Exactly. And so it becomes a very useful way to, kind of, get these ideas out there.

BADANES: That’s really interesting. And I love the provenance example because it really gets to the question about authenticity. Like, where did something come from? What is the origin of that media? Where has it traveled over time? And if AI is a component of it, then that’s a noted fact. But it doesn’t put us into the space of AI or not AI, which I think is where a lot of the, sort of, labeling has gone so far. And I understand the instinct to do that. But I like the idea of moving more towards how do you know more about an image of which whether there was AI involved or not is a component but does not have judgment. That does not make the picture good or bad. It doesn’t make it true or false. It’s just more information for you to consume. And then, of course, the media literacy piece, people need to know to look for those indicators and want them and ask for them from the technology company. So I think that’s a good, that’s a good silver lining. You gave me the light at the end of the tunnel I think I was looking for on the post-truth world. So, look, here’s the big question. You guys have been spending this time focusing on AI and democracy in this big, massive global election year. There was a lot of hype. [LAUGHS] There was a lot of hype. Lots of articles written about how this was going to be the AI election apocalypse. What say you? Was it? Was it not?

NESS: I think it was, well, we definitely have documented cases where this happened. And I’m wary of this question, particularly again from the cybersecurity standpoint, which is if you were not the victim of a terrible hack that brought down your entire company, would you say, like, well, it didn’t happen, so it’s not going to happen, right. You would never …

BADANES: Yeah.

NESS: That would be a silly attitude to have, right. And also, you don’t know what you don’t know, right. So, like, a lot of the, you know, we mentioned sextortion; we mentioned these cybercrimes. A lot of these are small-dollar crimes, which means they don’t get reported or they don’t get reported for reasons of shame. And so we don’t even have numbers on a lot of that. And we know that the political techniques are going to mirror the criminal techniques.

BADANES: Yeah.

NESS: And also, I worry about, say, down-ballot elections. Like, so much of, kind of, our election this year, a lot of the focus was on the national candidates, but, you know, if local poll workers are being targeted, if disinformation campaigns are being put out about local candidates, it’s not going get the kind of play in the national media such that you and I might hear about it. And so I’m, you know, so I’ll hand it off to Madeleine, but yeah.

DAEPP: So absolutely agree with Robert’s point, right. If your child was affected by sextortion, if you are a country that had an audio clip go viral, this was the deepfake deluge for you, right. That said, something that happened, you know, in India as in the United States, there were major prosecutions very early on, right.

BADANES: Yeah.

DAEPP: So in India, there was a video. It turned out not to be a deepfake. It turned out to be a “cheap fake,” to your point about, you know, the question isn’t whether there’s AI involved; the question is whether this is an attempt to defraud. And five people were charged for this video.

BADANES: Yeah.

DAEPP: And in the United States, right, those Biden robocalls using Biden’s voice to tell folks not to vote, like, that led to a million-dollar fine, I think, for the telecoms and $6 million for the consultant who created that. And when we talk to people in India, you know, people who work in this space, they said, well, I’m not going to do that; like, I’m going to focus on other things. So internal actors pay attention to these things. That really changes what people do and how they do it. And so that, I do think the work that your team did, right, to educate candidates about looking out for the stuff, the work that the MTAC (Microsoft Threat Analysis Center) did to track usage and report it, all of that, I think, was, actually, those interventions, I think, worked. I think they were really important, and I do think that what we are … this absence of a deluge is actually a huge number of people making a very concerted effort to prevent it from happening.

BADANES: That’s encouraging.

NESS: Madeleine, you made a really important point that this deterrence from prosecution, it’s effective for internal actors, …

BADANES: Yeah.

DAEPP: Yeah, that’s right.

NESS: … right. So for foreign states who are trying to interfere with other people’s elections, the fear of prosecution is not going to be as much of a deterrent.

BADANES: That is true. I will say what we saw in this election cycle, in particular in the US, was a concerted effort by the intelligence community to call out and name nation-state actors who were either doing cyberattacks or influence operations, specific videos that they identified, whether there was AI involved or not. I think that level of communication with the public while maybe doesn’t lead to those actors going to jail—maybe someday—but does in fact lead to a more aware public and therefore hopefully a less effective campaign. If people on the other end … and it’s a little bit into the literacy space, and it’s something that we’ve seen government again in this last cycle do very effectively, to name and shame essentially when they see these things in part, though, to make sure voters are aware of what’s happening. We’re not quite through this big global election year; we have a couple more elections before we really hit the end of the year, but it’s winding down. What is next for you all? Are you all going to continue this work? Are you going build on it? What comes next?

DAEPP: So our research in India actually wasn’t focused specifically on elections. It was about AI and digital communications.

BADANES: Ahh.

DAEPP: Because, you know, again, like India is this laboratory.

BADANES: Sure.

DAEPP: And I think what we learned from that work is that, you know, this is going to be a part of our digital communications and our information system going forward without question. And the question is just, like, what are the viable business models, right? What are the applications that work? And again, that comes back to making sure that whatever AI … you know, people when they build AI into their entire, you know, newsletter-writing system, when they build it into their content production, that they can feel confident that it’s safe and that it meets their needs and that they’re protected when they use it. And similarly, like, what are those applications that really work, and how do you empower those lead users while mitigating those harms and supporting civil society and mitigating those harms? I think that’s an incredible, like, that’s—as a researcher—that’s, you know, that’s a career, right.

BADANES: Yeah.

DAEPP: That’s a wonderful research space. And so I think understanding how to support AI that is safe, that enables people globally to have self-determination in how models represent them, and that is usable and powerful, I think that’s broadly …

BADANES: Where this goes.

DAEPP: … what I want to drive.

BADANES: Robert, how about you?

NESS: You know, so I mentioned earlier on these AI alignment issues.

BADANES: Yeah.

NESS: And I was really fascinated by how local and contextual those issues really are. So to give an example from Taiwan, we train these models on training data that we find from the internet. Well, when it comes to, say, Mandarin Chinese, you can imagine the proportion of content, of just the quantity of content, on the internet that comes from China is a lot more than the quantity that comes from Taiwan. And of course, what’s politically correct in China is different from what’s politically correct in Taiwan. And so when we were talking to Taiwanese, a lot of people had these concerns about, you know, having these large language models that reflected Taiwanese values. We heard the same thing in India about just people on different sides of the political spectrum and, kind of, looking at … a YouTuber in India had walked us through this … how, for example, a founding father of India, there was a disparate literature in favor of this person and some more critical of this person, and he had spent time trying to suss out whether GPT-4 was on one side or the other.

BADANES: Oh. Whose side are you on? [LAUGHS]

NESS: Right, and so I think for our alignment research at Microsoft Research, this becomes the beginning of, kind of, a very fruitful way of engaging with local stakeholders and making sure that we can reflect these concerns in the models that we develop and deploy.

BADANES: Yeah. Well, first, I just want to thank you guys for all the work you’ve done. This is amazing. We’ve really enjoyed partnering with you. I’ve loved learning about the research and the efforts, and I’m excited to see what you do next. I always want to end these kinds of conversations on a more positive note, because we’ve talked a lot about the weaponization of AI and, you know, how … ethical areas that are confusing and … but I am sure at some point in your work, you came across really positive use cases of AI when it comes to democracy, or at least I hope you have. [LAUGHS] Do you have any examples or can you leave us with something about where you see either it going or actively being used in a way to really strengthen democratic processes or systems?

DAEPP: Yeah, I mean, there is just a big paper in Science, right, which, as researchers, when something comes out in Science, you know your field is about to change, right, …

BADANES: Yeah.

DAEPP: … showing that an AI model in, like, political deliberations, small groups of UK residents talking about difficult topics like Brexit, you know, climate crisis, difficult topics, that in these conversations, an AI moderator created, like, consensus statements that represented the majority opinion, still showed the minority opinion, but that participants preferred to a human-written statement and in fact preferred to their original opinion.

BADANES: Wow.

DAEPP: And that this, you know, not only works in these randomized controlled trials but actually works in a real citizens deliberation. And so that potential of, like, carefully fine-tuned, like, carefully aligned AI to actually help people find points of agreement, that’s a really exciting space.

BADANES: So next time my kids are in a fight, I’m going to point them to Copilot and say, work with Copilot to mediate. [LAUGHS] No, that’s really, really interesting. Robert, how about you?

NESS: She, kind of, stole my example. [LAUGHTER] But I’ll take it from a different perspective. So, yes, like how these technologies can enable people to collaborate and ideally, I think, from a democratic standpoint, at a local level, right. So, I mean, I think so much of our politics were, kind of, focused at the national-level campaign, but our opportunity to collaborate is much more … we’re much more easily … we can collaborate much more easily with people who are in our local constituencies. And I think to myself about, kind of, like, the decline particularly of local newspapers, local media.

BADANES: Right.

NESS: And so I wonder, you know, can these technologies help address that problem in terms of just, kind of, information about, say, your local community, as well as local politicians. And, yeah, and to Madeleine’s point, so Madeleine started the conversation talking about her background in urban planning and some of the work she did, you know, working on a local level with local officials to bring technology to the level of cities. And I think, like, well, you know, politics are local, right. So, you know, I think that that’s where there’s a lot of opportunity for improvement.

BADANES: Well, Robert, you just queued up a topic for a whole other podcast because our team also does a lot of work around journalism, and I will say we have seen that AI at the local level with local news is really a powerful tool that we’re starting to see a lot of appetite and interest for in order to overcome some of the hurdles they face right now in that industry when it comes to capacity, financing, you know, not able to be in all of the places they want to be at once to make sure that they’re reporting equally across the community. This is, like, a perfect use case for AI, and we’re starting to see folks who are really using it. So maybe we’ll come back and do this again another time on that topic. But I just want to thank you both, Madeleine and Robert, for joining us today and sharing your insights. This was really a fascinating conversation. I know I learned a lot. I hope that our listeners learned a lot, as well.

[MUSIC]

And, listeners, I hope that you tune in for more episodes of Ideas, where we continue to explore the technologies shaping our future and the big ideas behind them. Thank you, guys, so much.

DAEPP: Thank you.

NESS: Thank you.

[MUSIC FADES] [1] The video generation model Sora was released publicly earlier this month (opens in new tab).

[2] For a summary of and link to the report, see the Microsoft On the Issues blog post China tests US voter fault lines and ramps AI content to boost its geopolitical interests (opens in new tab).

The post Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness appeared first on Microsoft Research.

Read More

Research Focus: Week of December 16, 2024

Research Focus: Week of December 16, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: Week of December 16, 2024

NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering

The Compute Express Link (CXL) open standard interconnect enables integration of diverse types of memory into servers via its byte-addressable SerDes links. To fully utilize CXL-based heterogeneous memory systems (which combine different types of memory with varying access speeds), it’s necessary to implement efficient memory tiering—a strategy to manage data placement across memory tiers for optimal performance. Efficiently managing these memory systems is crucial, but has been challenging due to the lack of precise and efficient tools for understanding how memory is accessed.

In a recent paper: NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering researchers from Microsoft propose a novel solution which features a hardware/software co-design to address this problem. NeoMem offloads memory profiling functions to CXL device-side controllers, integrating a dedicated hardware unit called NeoProf, which monitors memory accesses and provides the operating system (OS) with crucial page hotness statistics and other system state information. On the OS kernel side, the researchers designed a revamped memory-tiering strategy, enabling accurate and timely hot page promotion based on NeoProf statistics. Implemented on a real FPGA-based CXL memory platform and Linux kernel v6.3, NeoMem demonstrated 32% to 67% geomean speedup over several existing memory tiering solutions.


Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases

Planning and conducting chemical syntheses is a significant challenge in the discovery of functional small molecules, which limits the potential of generative AI for molecular inverse design. Although early machine learning-based retrosynthesis models have shown the ability to predict reasonable routes, they are less accurate for infrequent, yet important reactions.

In a recent paper: Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases, researchers from Microsoft and external colleagues address this limitation, with a new framework for building highly accurate reaction models. Chimera incorporates two newly developed models, each achieving state-of-the-art performance in their respective categories. Evaluations by PhD-level organic chemists show that Chimera’s predictions are preferred for their higher quality compared to baseline models.

The researchers further validate Chimera’s robustness by applying its largest-scale model to an internal dataset from a major pharmaceutical company, demonstrating its ability to generalize effectively under distribution shifts. This new framework shows the potential to substantially accelerate the development of even more accurate and versatile reaction prediction models.


Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


The GA4GH Task Execution API: Enabling Easy Multicloud Task Execution

In bioinformatics and computational biology, data analysis often involves chaining command-line programs developed by specialized teams at different institutions. These tools, which vary widely in age, software stacks, and dependencies, lack a common programming interface, which makes integration, workflow management and reproducibility challenging.

A recent article (opens in new tab) emphasizes the development, adoption and implementation of the Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API, created in collaboration with researchers at Microsoft and other institutions. The TES API offers a unified schema and interface for submitting and managing tasks, seamlessly bridging gaps between on-premises high-performance and high-throughput computing systems, cloud platforms, and hybrid infrastructures. Its flexibility and extensibility have already made it a critical asset for applications ranging from federated data analysis to load balancing across multi-cloud systems.

Adopted by numerous service providers and integrated into several workflow engines, TES empowers researchers to execute complex computational tasks through a single, abstracted interface. This eliminates compatibility hurdles, accelerates research timelines, reduces costs and enables “compute to data” solutions—essential for tackling the challenges of distributed data analysis.


RedCode: Risky Code Execution and Generation Benchmark for Code Agents

Increasing use of code agents for AI-assisted coding and software development has brought safety and security concerns, such as generating or executing malicious code, which have become significant barriers to real-world deployment of these agents.

In a recent paper: RedCode: Risky Code Execution and Generation Benchmark for Code Agents, published at NeurIPS 2024, researchers from Microsoft and external colleagues propose comprehensive and practical evaluations on the safety of code agents. RedCode is an evaluation platform with benchmarks grounded in four key principles: real interaction with systems, holistic evaluation of unsafe code generation and execution, diverse input formats, and high-quality safety scenarios and tests.

This research evaluated three agents based on various large language models (LLMs), providing insights into code agents’ vulnerabilities. For instance, results showed that agents are more likely to reject executing unsafe operations on the operating system. Unsafe operations described in natural text lead to a lower rejection rate than those in code format. Additional evaluations revealed that more capable base models and agents with stronger overall coding abilities, such as GPT-4, tend to produce more sophisticated harmful software.

These findings highlight the need for stringent safety evaluations for diverse code agents. The underlying dataset and related code are publicly available at https://github.com/AI-secure/RedCode (opens in new tab).


Towards industrial foundation models: Integrating large language models with industrial data intelligence

Although large language models (LLMs) excel at language-focused tasks like news writing, document summarization, customer service, and supporting virtual assistants, they can face challenges when it comes to learning and inference on numeric and structured industry data, such as tabular and time series data. To address these issues, researchers from Microsoft propose a new approach to building industrial foundation models (IFMs). As outlined in a recent blog post, they have successfully demonstrated the feasibility of cross-domain universal in-context learning on tabular data and the significant potential it could achieve.

The researchers designed Generative Tabular Learning (opens in new tab) (GTL), a new framework that integrates multi-industry zero-shot and few-shot learning capabilities into LLMs. This approach allows the models to adapt and generalize to new fields, new data, and new tasks more effectively, flexibly responding to diverse data science tasks. This technical paradigm has been open-sourced (opens in new tab) to promote broader use.

Microsoft Research in the news


Microsoft’s smaller AI model beats the big guys: Meet Phi-4, the efficiency king 

December 12, 2024

Microsoft launched a new artificial intelligence model today that achieves remarkable mathematical reasoning capabilities while using far fewer computational resources than its larger competitors.


Microsoft researcher Ece Kamar discusses the future of AI agents in 2025 

Tech Brew | December 12, 2024

With AI agents widely expected to take off in 2025, the director of Microsoft’s AI Frontiers lab weighs in on the future of this technology, the safeguards needed, and the year ahead in AI research.


A new frontier awaits — computing with light 

December 12, 2024

In the guts of a new type of computer, a bunch of tiny LEDs emit a green glow. Those lights have a job to do. They’re performing calculations. Right now, this math is telling the computer how to identify handwritten images of numbers. The computer is part of a research program at Microsoft.

The post Research Focus: Week of December 16, 2024 appeared first on Microsoft Research.

Read More

NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

Illustrated headshots of Lidong Zhou and Eliza Strickland

The Microsoft Research Podcast offers its audience a unique view into the technical advances being pursued at Microsoft through the insights and personal experiences of the people committed to those pursuits.

Just after his keynote at the 38th annual Conference on Neural Information Processing Systems (NeurIPS), Microsoft Corporate Vice President Lidong Zhou joins guest host Eliza Strickland of IEEE Spectrum at the conference to further explore the topic of his talk: the co-evolution of systems and AI. Zhou, who is also chief scientist of the Microsoft Asia-Pacific Research and Development Group and managing director of Microsoft Research Asia, discusses how rapidly advancing AI impacts the systems supporting it; AI as a tool for improving systems engineering itself; and how budding computer scientists can prepare for innovating in a world where AI and systems grow together.

Learn more: 

Verus: A Practical Foundation for Systems Verification
Publication, November 2024

SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
Publication, July 2024

BitNet: Scaling 1-bit Transformers for Large Language Models
Publication, October 2023

Transcript

[MUSIC]

ELIZA STRICKLAND: Welcome to the Microsoft Research Podcast, where Microsoft’s leading researchers bring you to the cutting edge. This series of conversations showcases the technical advances being pursued at Microsoft through the insights and experiences of the people driving them.

I’m Eliza Strickland, a senior editor at IEEE Spectrum and your guest host for a special edition of the podcast.

[MUSIC FADES]

Joining me today in the Microsoft Booth at the 38th annual Conference on Neural Information Processing Systems, or NeurIPS, is Lidong Zhou. Lidong is a Microsoft corporate vice president, chief scientist of the Microsoft Asia-Pacific Research and Development Group, and managing director of Microsoft Research Asia. Earlier today, Lidong gave a keynote here at NeurIPS on the co-evolution of AI and systems engineering.

Lidong, welcome to the podcast.


LIDONG ZHOU: Thank you, Eliza. It’s such a pleasure to be here.

STRICKLAND: You said in your keynote that progress in AI is now outpacing progress in the systems supporting AI. Can you give me some concrete examples of where the current infrastructure is struggling to keep up?

ZHOU: Yeah. So actually, we have been working on supporting AI from the infrastructure perspective, and I can say, you know, there are at least three dimensions where it’s actually posing a lot of challenges. One dimension is that the scale of the AI systems that we have to support. You know, you heard about the scaling law in AI and, you know, demanding even higher scale every so often. And when we scale, as I mentioned in the talk this morning, every time you scale the system, you actually have to rethink how to design a system, develop a new methodology, revisit all the assumptions. And it becomes very challenging for the community to keep up. And the other dimension is if you look at AI systems, it’s actually a whole-stack kind of design. You have to understand not only the AI workloads, the model architecture, but also the software and also the underlying hardware. And you have to make sure they are all aligned to deliver the best performance. And the third dimension is the temporal dimension, where you really see accelerated growth and the pace of innovation in AI and not actually only in AI but also in the underlying hardware. And that puts a lot of pressure on how fast we innovate on the systems side because we really have to keep up in that dimension, as well. So all those three dimensions add up. It’s becoming a pretty challenging task for the whole systems community.

STRICKLAND: I like how in your talk you proposed a marriage between systems engineering and AI. What does this look like in practice, and how might it change the way we approach both fields?

ZHOU: Yeah, so I’m actually a big fan of systems community and AI community work together to tackle some of the most challenging problems. Of course, you know, we have been working on systems that support AI. But now increasingly, we’re seeing opportunities where AI can actually help developers to become more productive and develop systems that are better in many dimensions in terms of efficiency, in terms of reliability, in terms of trustworthiness. So I really want to see the two communities work together even more closely going forward. You know, I talk about, sort of, the three pillars, right—the efficiency; there’s trust; there’s also the infusion of the two (AI and systems engineering)—that are three ambitions that we are actually working on. And we see very encouraging early results that makes us believe that there’s much more to be achieved going forward with the two communities working together.

STRICKLAND: You mentioned the challenging of scaling. I think everyone at NeurIPS is talking about scaling. And you’ve highlighted efficiency as a key opportunity for improvement in AI. What kind of breakthroughs in systems engineering or new ideas in systems engineering could help AI achieve greater efficiencies?

ZHOU: Yeah, that’s another great question. I think there are a couple of aspects to efficiency. So this morning, I talked about some of the innovations in model architecture. So our researchers have been looking into BitNet, which is essentially try to use one bit or, actually, using a ternary representation for the weights in all those AI models rather than using FP16 and so on. And that potentially creates a lot of opportunities for efficiency and energy gains. But that cannot be done without rethinking about the software and even the hardware stack so that, you know, those innovations that you have in the model architecture can actually have the end-to-end benefits. And that’s, you know, one of the dimensions where we see the coinnovation of AI and underlying system to deliver some efficiency gains for AI models, for example. But there’s another dimension, which I think is also very important. With all the AI infrastructure that we build to support AI, there’s actually a huge room for improvement, as well. And this is where AI can actually be utilized to solve some of the very challenging systems problems, for optimization, for reliability, for trustworthiness. And I use some of the examples in my talk, but this is a very early stage. I think the potential is much larger going forward.

STRICKLAND: Yeah. It’s interesting to think about how GPUs and large language models are so intertwined at this point. You can’t really have one without the other. And you said in your talk you sort of see the need to decouple the architectures and the hardware. Is that right?

ZHOU: Yes. Yeah, so this is always, you know, like very system type of thinking where, you know, you really want to decouple some of the elements so that they can evolve and innovate independently. And this gives more opportunities, you know, larger design space, for each field. And what we are observing now, which is actually very typical in relatively mature fields, where we have GPUs that are dominating in the hardware land and all the model architecture has to be designed and, you know, proving very efficient on GPUs. And that limits the design space for model architecture. And similarly, you know, if you look at hardware, it’s very hard for hardware innovations to happen because now you have to show that those hardwares are actually great for all the models that have been actually optimized for GPUs. So I think, you know, from a systems perspective, it’s actually possible if you design the right abstraction between the AI and the hardware, it’s possible for this two domains to actually evolve separately and have a much larger design space, actually, to find the best solution for both.

STRICKLAND: And when you think about systems engineering, are there ways that AI can be used to optimize your own work?

ZHOU: Yes, I think there are. Two examples that I gave this morning, one is, you know, in systems there’s this what we call a holy grail of system research because we want to build trustworthy systems that people can depend on. And one of the approach is called verified systems. And this has been a very active research area in systems because there are a lot of advancements in formal methods in how we can infuse the formal method into building real systems. But it’s still very hard for the general system community because, you know, you really have to understand how formal methods works and so on. And so it’s still not within reach. You know, like when we build mission-critical systems, we want to be completely verified so, you know, you don’t have to do a lot of testing to show that there are no bugs. You’ll never be able to show there’s no bugs with testing. But if you …

STRICKLAND: Sorry, can I pause you for one moment? Could you define formal verification for our listeners, just in case they don’t know?

ZHOU: Yeah, that’s a good point. I think the easy way to think about this is formal verification, it uses mathematical logic to describe, say, a program and, you know, it can represent some properties in math, essentially, in logic. And then you can use a proof to show that the program has certain properties that you desire, and a simple form, like, a very preliminary form of formal (specification for) verification is, you know, just assertions in the program, right, where it, say, asserts A is not equal to zero. And that’s a very simple form of logic that must hold (or be proven to hold), and then, you know, the proof system is also much more complicated to talk about more advanced properties of programs, their correctness, and so on.

STRICKLAND: Mm-hm.

ZHOU: So I think that the opportunity that we’re seeing is that with the help of AI, I think we are on the verge of providing the capability of building verified systems, at least for some of the mission-critical pieces of systems. And that would be a very exciting area for systems and AI to tackle together. And I think we’re going to see a paradigm shift in systems where some pieces of system components will actually be implemented using AI. [What] is interesting is, you know, system is generally deterministic because, so, you know, when you look at the traditional computer system, you want to know that it’s actually acting as you expected, but AI, you know, it can be stochastic, right. And it might not always give you the same answer. But how you combine these two is another area where I see a lot of opportunities for breakthroughs.

STRICKLAND: Yeah, yeah. I wanted to back up in your career a little bit and talk about the concept of gray failures because you were really instrumental in defining this concept, which for people who don’t know, gray failures are subtle and partial failures in cloud-scale systems. They can be very difficult to detect and can lead to major problems. I wanted to see if you’re still thinking about gray failures in the context of your thinking about AI and systems. Are gray failures having an impact on AI today?

ZHOU: Yes, definitely. So when we were looking at cloud systems, we realized the … so in systems, we developed a lot of mechanisms for reliability. And when we look at the cloud systems, when they reach a certain scale, a lot of methodology we develop in systems for reliability actually no longer applies. One of the reasons is we have those gray failures. And then we moved to looking at AI infrastructure. The problem is actually even worse because what we realize is there’s a lot of built-in redundancy at every level, like in GPUs, memory, or all the communication channels. And because of those built-in redundancies, sometimes the system is experience failures, but they’re being masked because of the redundancies. And that makes it very hard for us to actually maintain the system, debug the system, or to troubleshooting. And for AI infrastructure, what we have developed is a very different approach using proactive validation rather than reactive repair. And this is actually a paper that we wrote recently in USENIX ATC that talks about how we approach reliability in AI infrastructure, where the same concept happens to apply in a new meaning.

STRICKLAND: Mm. I like that. Yeah. So tell me a little bit about your vision for where AI goes from here. You talked a little bit in your keynote about AI-infused systems. And what would that look like?

ZHOU: Yeah, so I think AI is going to transform almost everything, and that includes systems. That’s why I’m so happy to be here to learn more from the AI community. But I also believe that for every domain that AI is going to transform, you really need the domain expertise and, sort of, the combination of AI and that particular domain. And the same for systems. So when we look at what we call AI-infused systems, we really see the opportunity where there are a lot of hard system challenges can be addressed by AI. But we need to define the right interface between the system and the AI so that we can leverage the advantage of both, right. Like, AI is creative. It comes up with solutions that, you know, people might not think of, but it’s also a little bit random sometimes. It could, you know, give you wrong answers. But systems are very grounded and very deterministic. So we need to figure out what is the design paradigm that we need to develop so that we can get the best of both worlds.

STRICKLAND: Makes sense. In your talk you gave an example of OptiFlow. Could you tell our listeners a bit about that?

ZHOU: Yeah. This is a pretty interesting project that is actually done in Microsoft Research Asia jointly with the Azure team where we look at collective communication, which is a major part of AI infrastructure. And it turns out, you know, there’s a lot of room for optimization. It was initially done manually. So an expert had to take a look at the system and look at the different configurations and do all kinds of experiments, and, you know, it takes about two weeks to come up with a solution. This is why I say, you know, the productivity is becoming a bottleneck for our AI infrastructure because people are in the loop who have to develop solutions. And it turns out that this is a perfect problem for AI, where AI can actually come up with various solutions. It can actually develop good system insights based on the observations from the system. And so OptiFlow, what it does is it comes up with the, sort of, the algorithm or the schedule of communications for different collective communication primitives. And it turns out to be able to discover algorithms that’s much better than the default one or, you know, for different settings. And it’s giving us the benefits of the productivity; also, efficiency.

STRICKLAND: And you said that this is in production today, right?

ZHOU: Yes. It is in production.

STRICKLAND: That’s exciting. So thinking still to the future, how might the co-evolution of AI and systems change the skills needed for future computer scientists?

ZHOU: Yeah, that’s a very deep question. As I mentioned, I think being fluent in AI is very important. But I also believe that domain expertise is probably undervalued in many ways. And I see a lot of needs for this interdisciplinary kind of education where someone who not only understands AI and what AI technology can do but also understands a particular domain very well. And those are the people who will be able to figure out the future for that particular domain with the power AI. And I think for students, certainly it’s no longer sufficient for you to be an expert in a very narrow domain. I think we see a lot of fields sort of merging together, and so you have to be an expert in multiple domains to see new opportunities for innovations.

STRICKLAND: So what advice would you give to a high school student who’s just starting out and thinks, ah, I want to get into AI?

ZHOU: Yeah, I mean certainly there’s a lot of excitement over AI, and it would be great for high school students to, actually, to have the firsthand experience. And I think it’s their world in the future. Because they probably can imagine a lot of things from scratch. I think they probably have the opportunity to disrupt a lot of the things that we take for granted today. So I think just use their imagination. And I don’t think we have really good advice for the young generation. It’s going to be their creativity and their imagination. And AI is definitely going to empower them to do something that’s going to be amazing.

STRICKLAND: Something that we probably can’t even imagine.

ZHOU: Right.

STRICKLAND: Yeah.

ZHOU: I think so.

STRICKLAND: I like that. So as we close, I’m hoping you can look ahead and talk about what excites you most about the potential of AI and systems working together, but also if you have any concerns, what concerns you most?

ZHOU: Yeah, I think in terms of AI systems, I’m certainly pretty excited about what we can do together, you know, with a combination of AI and systems. There are a lot of low-hanging fruit, and there are also a lot of potential grand challenges that we can actually take on. I mentioned a couple in this morning’s talk. And certainly, you know, we also want to look at the risks that could happen, especially when we have systems and AI start to evolve together. And this is also in an area where having some sort of trust foundation is very important so we can have some assurance of the kind of system or AI system that we are going to build. And this is actually fundamental in how we think about trust in systems. And I think that concept can be very useful for us to guard against unintended consequences or unintended issues.

[MUSIC]

STRICKLAND: Well, Lidong Zhou, thank you so much for joining us on the podcast. I really enjoyed the conversation.

ZHOU: It’s such a pleasure, Eliza.

STRICKLAND: And to our listeners, thanks for tuning. If you want to learn more about research at Microsoft, you can check out the Microsoft Research website at Microsoft.com/research. Until next time.

[MUSIC FADES]

The post NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou appeared first on Microsoft Research.

Read More

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

A diagram illustrating the joint optimization process of instructions and in-context examples in PromptWizard. The figure demonstrates how the framework iteratively refines both components, integrating feedback to enhance the overall prompt effectiveness and adaptability across tasks.

The challenge of effective prompting

AI is reshaping industries—from education to healthcare—thanks to advancements in large language models (LLMs). These models rely on prompts, carefully crafted inputs that guide them to produce relevant and meaningful outputs. While the impact of prompts is profound, creating prompts that can help with complex tasks is a time-intensive and expertise-heavy process, often involving months of trial and error. 

This challenge grows as new tasks arise and models evolve rapidly, making manual methods for prompt engineering increasingly unsustainable. The question then becomes: How can we make prompt optimization faster, more accessible, and more adaptable across diverse tasks? 

To address this challenge, we developed PromptWizard (PW), a research framework that automates and streamlines the process of prompt optimization. We are open sourcing the PromptWizard codebase (opens in new tab) to foster collaboration and innovation within the research and development community.

Introducing PromptWizard

PromptWizard (PW) is designed to automate and simplify prompt optimization. It combines iterative feedback from LLMs with efficient exploration and refinement techniques to create highly effective prompts within minutes.

PromptWizard optimizes both the instruction and the in-context learning examples. Central to PW is its self-evolving and self-adaptive mechanism, where the LLM iteratively generates, critiques, and refines prompts and examples in tandem. This process ensures continuous improvement through feedback and synthesis, achieving a holistic optimization tailored to the specific task at hand. By evolving both instructions and examples simultaneously, PW ensures significant gains in task performance. 

Three key insights behind PromptWizard:

  • Feedback-driven refinement: At its core, PW leverages an iterative feedback loop where the LLM generates, critiques, and refines its own prompts and examples. This continuous improvement mechanism ensures that each iteration is better than the last, leading to highly effective prompts and examples. 
  • Joint optimization and synthesis of diverse examples: PW generates synthetic examples that are not only robust and diverse but also task-aware. By optimizing prompts and examples together, it ensures they work in tandem to address specific task requirements effectively. 
  • Self-generated chain-of-thought (CoT) steps: Incorporating CoT reasoning improves the problem-solving capabilities of the model. By using selected few-shot examples, PW generates a detailed reasoning chain for each example, facilitating nuanced and step-by-step problem-solving approaches.
Fig 1: A diagram providing an overview of the PromptWizard process. It illustrates the main components, including iterative prompt generation, feedback-based refinement, and joint optimization of instructions and examples. The workflow emphasizes modularity and adaptability, demonstrating how PromptWizard evolves prompts to improve performance across diverse tasks.
Figure 1. Overview of PromptWizard

How PromptWizard works

PromptWizard begins with a user input: a problem description, an initial prompt instruction, and a few training examples that serve as a foundation for the task at hand.

Its output is a refined, optimized set of prompt instructions paired with carefully curated in-context few-shot examples. These outputs are enriched with detailed reasoning chains, task intent, and an expert profile that bridges human-like reasoning with the AI’s responses. 

Stage 1: Refinement of prompt instruction

The first stage focuses on refining the task instructions of a prompt. PromptWizard generates multiple candidate instructions, evaluates them using feedback from the LLM, and iteratively synthesizes improved versions. This process balances exploration—trying diverse ideas—and exploitation—refining the most promising ones.

For example, if an initial instruction yields suboptimal results, PW incorporates feedback to identify its shortcomings and generates an improved version. Over three to five iterations, this iterative cycle ensures that the instruction converges to an optimal state. 

Fig 2: A visualization of the refinement process for prompt instructions in PromptWizard. The figure highlights iterative improvements, where initial instructions are critiqued, adjusted based on feedback, and fine-tuned to achieve greater accuracy and alignment with task objectives.
Figure 2. Refinement of prompt instruction

Stage 2: Joint optimization of instructions and examples

The refined prompt obtained from Stage 1 is combined with carefully selected examples, and both are optimized together. Through the critique-and-synthesis mechanism, PromptWizard ensures alignment between the prompt and examples, simultaneously synthesizing new examples to enhance task performance.

This structured approach makes PromptWizard highly versatile, adapting to tasks as varied as solving math problems or generating creative content. 

Fig 3: A diagram illustrating the joint optimization process of instructions and in-context examples in PromptWizard. The figure demonstrates how the framework iteratively refines both components, integrating feedback to enhance the overall prompt effectiveness and adaptability across tasks.
Figure 3. Joint optimization of instructions and examples

Microsoft research podcast

Abstracts: August 15, 2024

Advanced AI may make it easier for bad actors to deceive others online. A multidisciplinary research team is exploring one solution: a credential that allows people to show they’re not bots without sharing identifying information. Shrey Jain and Zoë Hitzig explain.


Results

PromptWizard stands out for its feedback-driven refinement and systematic exploration, delivering exceptional results across a wide variety of tasks while maintaining computational efficiency. 

Comprehensive evaluation across tasks

PromptWizard was rigorously evaluated on over 45 tasks, spanning both general and domain-specific challenges. Benchmarked against state-of-the-art techniques—including Instinct, InstructZero, APE, PromptBreeder, EvoPrompt, DSPy, APO, and PromptAgent—PW consistently outperformed competitors in accuracy, efficiency, and adaptability. Please see detailed results in our paper

  • Accuracy: PW consistently outperformed other methods, maintaining performance close to the best across all tasks. Figure 4 shows the performance profile curve that highlights PW’s reliability, demonstrating how frequently it achieves near-best accuracy compared to other approaches for BigBench Instruction Induction dataset (BBII).
  • Efficiency: Beyond accuracy, PW demonstrates its computational efficiency. Unlike many baseline methods that require extensive API calls and computational resources, PW achieves superior results with minimal overhead by striking an effective balance between exploration and exploitation. Table 1 demonstrates PW’s cost-effectiveness, with significantly reduced token usage for input and output while optimizing prompts effectively.
Fig 4: A performance profile curve illustrating PromptWizard's reliability on the BigBench Instruction Induction (BBII) dataset. The curve demonstrates how often PromptWizard achieves accuracy levels close to the best performance when compared to other approaches, highlighting its consistency and effectiveness.
Figure 4. Performance Profile curve on BBII dataset
Methods API calls Total tokens
Instinct 1730 115k
PromptBreeder 18600 1488k
EvoPrompt 5000 400k
PW 69 24k
Table 1. Cost analysis on BBII dataset

We also have conducted numerous experiments to highlight PromptWizard’s efficacy with limited training data and smaller LLMs. 

Resilience with limited data

Real-world scenarios often lack abundant training data. PW excels in such conditions, requiring as few as five examples to produce effective prompts. Across five diverse datasets, PW demonstrated an average accuracy drop of only 5% when using five examples compared to 25 examples—highlighting its adaptability and efficiency (see Table 2). 

Datasets 5 Examples 25 Examples
MMLU 80.4 89.5
GSM8k 94 95.4
Ethos 86.4 89.4
PubMedQA 68 78.2
MedQA 80.4 82.9
Average 81.9 87
Table 2. PW’s performance with varying number of examples

Leveraging smaller models for optimization

PromptWizard also reduces computational costs by using smaller LLMs for prompt generation, reserving more powerful models for inference. For example, using Llama-70B for prompt generation resulted in negligible performance differences compared to GPT-4, while significantly lowering resource usage (see Table 3).

Dataset Prompt Gen: Llama-70B Prompt Gen: GPT4
GSM8k 94.6 95.4
Ethos 89.2 89.4
Average 91.9 92.4
Table 3. Performance with smaller LLMs for prompt generation 

PromptWizard shows that effective prompts combine optimized instructions refined through iterative feedback, thoughtfully chosen in-context examples, and a modular design that incorporates expert knowledge and task-specific intent. This approach enables the framework to handle a broad range of tasks, from simple to highly complex, with exceptional efficiency and flexibility.

 Whether you are a researcher addressing cutting-edge challenges or an organization looking to streamline workflows, PromptWizard provides a practical, scalable, and impactful solution for enhancing model performance.

The post PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts appeared first on Microsoft Research.

Read More

Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users

Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users

GraphRAG blog hero - cluster of small circular nodes on a blue/green gradient background

Introducing GraphRAG 1.0

Microsoft debuted (opens in new tab) the pre-release version of GraphRAG (opens in new tab) in July 2024 to advance AI use in complex domains. Since that time, we’ve seen incredible adoption and community engagement (over 20k stars and 2k forks on GitHub as of this writing), with numerous fixes and improvements by the core team and community contributors. We’re deeply grateful for the contributions and feedback we’ve received and are excited to share a number of major ergonomic and structural improvements that culminate in the official release of GraphRAG 1.0. 

Ergonomic refactors

Easier setup for new projects

When we first launched GraphRAG, most config was done using environment variables, which could be daunting, given the many options available. We’ve reduced the friction on setup by adding an init command (opens in new tab) that generates a simplified starter settings.yml file with all core required config already set. We recommend developers start here to ensure they get the clearest initial config. With this update, a minimal starting config does not require the user to have expertise with GraphRAG for a quick setup, only an OpenAI API key in their environment. 

New and expanded command line interface

We expanded the functionality and ease of use of the command line interface (opens in new tab) (CLI) and adopted Typer (opens in new tab) to provide better inline documentation and a richer CLI experience. The original CLI was intended as a starter demo for users to try GraphRAG on a sample dataset. We’ve since learned from the community that most people actually want to use this as their primary interaction mode for GraphRAG, so as part of this milestone release, we’ve incorporated enhancements that result in a more streamlined experience. From this work, CLI startup times dropped from an average of 148 seconds to 2 seconds. 

Consolidated API layer

In August 2024 we introduced a standalone API layer to simplify developer usage. The original CLI contained all the code required to instantiate and execute basic indexing and query commands, which users often needed to replicate. The API layer is still considered provisional as we gather feedback, but is intended to be the primary entry point for developers who wish to integrate GraphRAG functionality into their own applications without deep pipeline or query class customization. In fact, the CLI and Accelerator (opens in new tab) are built entirely on top of the API layer, acting as a documented example of how to interact with the API. We have also added examples of how to use this API to our notebook collection (opens in new tab) that we will continue to update as we iterate in future releases. 

Simplified data model

GraphRAG creates several output artifacts to store the indexed knowledge model. The initial model contained a large number of files, fields, and cross-references based on experimental ideas during the early research, which can be overwhelming for both new and routine users. We performed a comprehensive review of the data model and incorporated fixes to add clarity and consistency, remove redundant or unused fields, improve storage space, and simplify the data models. Previously, the output lacked standardization, and relevant outputs could easily be confused with non-critical intermediary output files. Now with GraphRAG 1.0, the output will only include relevant outputs that are easily readable and traceable. 

Microsoft research podcast

Abstracts: August 15, 2024

Advanced AI may make it easier for bad actors to deceive others online. A multidisciplinary research team is exploring one solution: a credential that allows people to show they’re not bots without sharing identifying information. Shrey Jain and Zoë Hitzig explain.


Streamlined vector stores

Embeddings and their vector stores are some of the primary drivers of  GraphRAG’s storage needs. Our original data model stored all embeddings within the parquet output files after data ingestion and indexing. This made the files portable, which was convenient for early research, but for many users it became unnecessary as they configured their own vector stores and the scale of data ingestion grew. We have updated the GraphRAG pipeline to create a default vector store during indexing, so no post-processing is needed, and the query library shares this configuration for seamless use. The benefit of this change is that those vectors (which can be quite large) no longer need to be loaded when the output files are read from disk, saving read time and memory during every query. Coupled with the simplified data model, this resulted in output parquet disk savings of 80%, and total disk space (including embeddings in the vector store) reduction of 43%. GraphRAG supports LanceDB and Azure AI Search out-of-the-box for vector stores. For simple startup, LanceDB is used as the default, and is written to a local database alongside the knowledge model artifacts. 

Flatter, clearer code structure

A key initiative on the road to version 1.0 has been to simplify the codebase so it is easier to maintain and more approachable for third-party users. We’ve removed much of the code depth from the organization to make it easier to browse, and co-located more code that our own usage patterns indicate was not required to be in separate functional areas. 

We have also found that very few users need the declarative configuration that the underlying DataShaper (opens in new tab) engine provides, so we collapsed these 88 verbose workflow definitions into a smaller set of 11 workflows that operate in a functional versus composed manner. This makes the pipeline easier to understand and is a step toward an architecture that is better suited for our future research plans and improves performance across the board. By collapsing workflows, we now have fewer unused output artifacts, reduced data duplication, and fewer disk I/O operations. This streamlining has also reduced the in-memory footprint of the pipeline, enabling users to index and analyze larger datasets with GraphRAG.

Incremental ingest

Until now, an evolving dataset needed complete re-indexing every time new information was acquired in order to re-generate the knowledge model. In GraphRAG 1.0 we are including a new update command in the CLI that computes the deltas between an existing index and newly added content and intelligently merges the updates to minimize re-indexing. GraphRAG uses an LLM caching mechanism to save as much cost as possible when re-indexing, so re-runs over a dataset are often significantly faster and cheaper than an initial run. Adding brand new content can alter the community structure such that much of an index needs to be re-computed – the update command (opens in new tab) resolves this while also improving answer quality. 

Availability

GraphRAG version 1.0 is now available on GitHub (opens in new tab), and published to PyPI (opens in new tab). Check out the Getting Started (opens in new tab) guide to use GraphRAG 1.0 today. today. 

Migrating

We recommend users migrate to GraphRAG 1.0, which offers a streamlined experience including multiple improvements for both users and developers. However, because of the breadth of its updates, version 1.0 is not backwards compatible. If you’ve used GraphRAG prior to version 1.0 and have existing indexes, there are a handful of breaking changes that need to be addressed, but this should be a straightforward process. To support the community in this migration, we’ve created a migration guide (opens in new tab) in the repository with more information. 

Future directions

We recently posted about a brand-new approach to GraphRAG called LazyGraphRAG, which performs minimal up-front indexing to avoid LLM usage until user queries are executed. This avoids LLM-based summarization of large volumes of content that may not be interesting to users – and therefore never explored even after expensive processing. This approach shows strong performance at a fraction of the cost of GraphRAG, and will be added to the core GraphRAG codebase in the near future as a new option for users. 

Additionally, Microsoft has been active in exploring how GraphRAG can advance the rate of scientific progress, and is in the process of building relevant GraphRAG capabilities to align with our broader work in AI-enabled scientific discovery (opens in new tab).

We continue to refine the codebase and investigate architectural changes that will enable users to use their own language model APIs, storage providers, and vector stores. We’re excited about this major milestone, and the foundation that this refactoring lays for our continued research in the GraphRAG space.

The post Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users appeared first on Microsoft Research.

Read More

NeurIPS 2024: AI for Science with Chris Bishop

NeurIPS 2024: AI for Science with Chris Bishop

Illustrated headshots of Chris Bishop and Eliza Strickland.

The Microsoft Research Podcast offers its audience a unique view into the technical advances being pursued at Microsoft through the insights and personal experiences of the people committed to those pursuits. 

In this special edition of the podcast, Technical Fellow and Microsoft Research AI for Science Director Chris Bishop joins guest host Eliza Strickland of IEEE Spectrum at the 38th annual Conference on Neural Information Processing Systems (NeurIPS) to talk about deep learning’s potential to improve the speed and scale at which scientific advancements can be made. Bishop discusses the factors considered when choosing which scientific challenges to tackle with AI; the impact foundation models are having right now in areas such as drug discovery and weather forecasting; and the work at NeurIPS that he’s excited about.

Learn more:

From forecasting storms to designing molecules: How new AI foundation models can speed up scientific discovery (opens in new tab)
Microsoft Source blog, October 2024 

Introducing Aurora: The first large-scale foundation model of the atmosphere
Microsoft Research blog, June 2024 

GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases 
Microsoft Research blog, January 2024 

AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop 
Microsoft Research Podcast, December 2023 

AI4Science to empower the fifth paradigm of scientific discovery 
Microsoft Research blog, July 2022 

Novartis empowers scientists with AI to speed the discovery and development of breakthrough medicines (opens in new tab) 
Microsoft Source, November 2021 

Bringing together deep bioscience and AI to help patients worldwide: Novartis and Microsoft work to reinvent treatment discovery and development (opens in new tab) 
Official Microsoft Blog, October 2019 

Transcript

[MUSIC] 

ELIZA STRICKLAND: Welcome to the Microsoft Research Podcast, where Microsoft’s leading researchers bring you to the cutting edge. This series of conversations showcases the technical advances being pursued at Microsoft through the insights and experiences of the people driving them.  

I’m Eliza Strickland, a senior editor at IEEE Spectrum and your guest host for a special edition of the podcast.  

[MUSIC FADES] 

Joining me today in the Microsoft Booth at the 38th annual Conference on Neural Information Processing Systems, or NeurIPS, is Chris Bishop. Chris is a Microsoft technical fellow and the director of Microsoft Research AI for Science. Chris is with me for one of our two on-site conversations that we’re having here at the conference.  

Chris, welcome to the podcast.


CHRIS BISHOP: Thanks, Eliza. Really great to join you. 

STRICKLAND: How did your long career in machine learning lead you to this focus on AI for Science, and were there any pivotal moments when you started to think that, hey, this deep learning thing, it’s going to change the way scientific discovery happens? 

BISHOP: Oh, that’s such a great question. I think this is like my career coming full circle, really. I started out studying physics at Oxford, and then I did a PhD in quantum field theory. And then I moved into the fusion program. I wanted to do something of practical value, [LAUGHTER] so I worked on nuclear fusion for about seven or eight years doing theoretical physics, and then that was about the time that Geoff Hinton published his backprop paper. And it really caught my imagination as an exciting approach to artificial intelligence that might actually yield some progress. So that was, kind of, 35 years ago, and I moved into the field of machine learning. And, actually, the way I made that transition was by applying neural networks to fusion. I was working at the JET experiment, which was the world’s largest fusion experiment. It was sort of big data in its day. And so I had to, first of all, teach myself to program.  

STRICKLAND: [LAUGHS] Right.  

BISHOP: I was a pencil-and-paper theoretician up to that point. Persuade my boss to buy me a workstation and then started to play with these neural nets. So right from the get-go, I was applying machine learning 35 years ago to data from science experiments. And that was a great on-ramp for me. And then, eventually, I just got so distracted, I decided I wanted to build my career in machine learning. Spent a few years as a research professor and then joined Microsoft 27 years ago, when Microsoft opened its first research lab outside the US in Cambridge, UK, and have been there very happily ever since. Went on to become lab director. But about three or four years ago, I realized that not only was deep learning transforming so many different things, but I felt it was especially relevant to scientific discovery. And so I had an opportunity to pitch to our chief technology officer to go start a new team. And he was very excited by this. So just over two and a half years ago now, we set up Microsoft Research AI for Science, and it’s a global team, and it, sort of, does what it says on the tin. 

STRICKLAND: So you’ve said that AI could usher in a fifth paradigm of scientific discovery, which builds upon the ideas of Turing Award–winner Jim Gray, who described four stages in the evolution of science. Can you briefly explain the four prior paradigms and then tell us about what makes this stage different? 

BISHOP: Yeah, sure. So it was a nice insight by Jim. He said, well, of course, the first paradigm of scientific discovery was really the empirical one. I tend to think of some cave dweller picking up a big rock and a small rock and letting go of them at the same time and thinking the big rock will hit the ground first … 

STRICKLAND: [LAUGHS] Right … 

BISHOP: … discovering they land together. And this is interesting. They’ve discovered a, sort of, pattern irregularity in nature, and even today, the first paradigm is in a sense the prime paradigm. It’s the most important one because at the end of the day, it’s experimental results that determine the truth, if you like. So that’s the first paradigm. And it continues to be of critical importance today. And then the second paradigm really emerged in the 17th century. When Newton discovered the laws of motion and the law of gravity, and not only did he discover the equations but this, sort of, remarkable fact that nature can even be described by equations, right. It’s not obvious that this would be true, but it turns out that, you know, the world around us can be described by very simple equations that you can write on a T-shirt. And so in the 19th century, James Clerk Maxwell discovered some simple equations that describe the whole of electricity and magnetism, electromagnetic waves, and so on. And then very importantly, the beginning of the 20th century, we had this remarkable breakthrough in quantum physics. So again down at the molecular—the atomic—level, the world is described with exquisite precision by Schrödinger’s equation. And so this was the second paradigm, the theoretical. That the world is described with incredible precision of a huge range of length and time by very simple equations.  

But of course, there’s a catch, which is those equations are very hard to solve. And so the third paradigm really began, I guess, sort of, in the ’50s and ’60s, the development of digital computers. And, actually, the very first use of digital computers was to simulate physics, and it’s been at the core of digital computing right up to the present day. And so what you’re doing there is using a computer to go with a numerical algorithm to solve those very simple equations but solve them in a practical setting. And so that’s, I’ll refer to that as simulation. That’s the third paradigm. And that’s proven to be tremendously powerful. If you look up the weather forecast on your phone today, it’s done by numerical weather forecasting, solving in those case Navier-Stokes equations using big numerical simulators. What Jim Gray observed, though, really emerging at the beginning of the 21st century was what he called the fourth paradigm, or data-intensive scientific discovery. So this is the era of big data. Think of particle physics at the CERN accelerator, for example, generating colossal amounts of data in real time. And that data can then be processed and filtered. We can do statistics on it. But of course, we can do machine learning on that data. And so machine learning feeds off large data. And so the fourth paradigm really is dominated today by machine learning. And again that remains tremendously important.  

What I noticed, though, is that there’s again another framework. We call it the fifth paradigm. Again, it goes back to those fundamental equations. But again, it’s driven by computation, and it’s the idea that we can train machine learning systems not using the empirical data of the fourth paradigm but instead using the results of simulation. So the output of the third paradigm. So think of it this way. You want to predict the property of some molecule, let’s say. You could in principle solve Schrödinger’s equation on a digital computer; it’d be very expensive. And let’s say you want to screen hundreds of millions of molecules. That’s going to get far too costly. So instead, what you can do is have a mindset shift. You can think of that simulator not as a tool to predict the molecule’s properties directly but instead as a way of generating synthetic training data. And then you use that training data to train a deep learning system to give what I like to call an emulator, an emulator of the simulator. Once it’s trained, that emulator is fast. It’s usually three to four orders of magnitude faster than the simulator. So if you’re going to do something over and over again, that three-to-four-order-of-magnitude acceleration is tremendously disruptive. And what’s really interesting is we see that fifth paradigm occur in many, many different places. The idea goes back a long way. The, actually, the last project that I worked on before I left the fusion program was to do what was the world’s first-ever real-time control of a tokamak fusion plasma using a neural net and the computers of the day. But the processors were just far too slow, long before GPUs, and so on. And so it wasn’t possible to solve the equations. In that case, it was called the Grad-Shafranov equation. Again, a simple differential equation you could write on a T-shirt, but solving it was expensive on a computer. We were about a million times too slow to solve it directly in real time. And so instead, we generated lots and lots of solutions. We used those solutions to train a very simple neural network, not a deep network, just a simple two-layer network back in the day, and then we implemented that in special hardware and did real-time feedback control. So that was an example of the fifth paradigm from, you know, a quarter of a century ago. But of course, deep learning just tremendously expands the range of applicability. So today we’re using the fifth paradigm in many, many different scenarios. And time and time again, we see these four-orders-of-magnitude acceleration. So I think it’s worthy of thinking of that as a new paradigm because it’s so pervasive and so ubiquitous. 

STRICKLAND: So how do you identify fields of science and particular problems that are amenable to this kind of AI assistance? Is it all about availability of data or the need for that kind of speed up? 

BISHOP: So there are lots of factors that go into this. And when I think about AI for Science actually, the space of opportunity is colossal because science is, science is really just understanding more about the world around us. And so the range of possibilities is daunting really. So in choosing what to work on, I think there are several factors. Yes, of course, data is important, but very interestingly, we can use experimental data or we can generate synthetic data by running simulators. So we’re a big fan of the fifth paradigm. But I think another factor—and this is particularly at Microsoft—is thinking about, how can we have real-world impact at scale? Because that’s our job, is to make the world a better place and to do so at a planetary scale. And so we’ve settled on, for the most part, working at the molecular level. So if you think about the number of different ways of combining atoms together to make new stable configurations of atoms, it’s gargantuan. I mean, the number of just small molecules, small organic molecules, that are potential drug candidates is about 1060. It’s about the same as the number of atoms in the solar system. The number of proteins, maybe the fourth power of the number of atoms in the universe, or something crazy. So you’ve got this gargantuan space to search, and within that space, for sure, there’ll be all sorts of interesting molecules, materials, new drugs, new therapies, new materials for carbon capture, new kinds of batteries, new photovoltaics. The list is endless because everything around us is made of atoms, including our own bodies. So the potential just in the molecular space is gargantuan. And so that’s why we focus there. 

STRICKLAND: It’s a big focus. [LAUGHTER] 

BISHOP: It’s a broad focus, still, yes. 

STRICKLAND: So let’s take one of these case studies then. In a project on drug discovery, you worked with the Global Health Drug Discovery Institute on molecules that would interact with tuberculosis and coronaviruses, I think. And you found, I think, candidate molecules in five months instead of several years. Can you talk about what models you used in this work and how they helped you get this vastly sped up process? 

BISHOP: Sure. Yes. We’re very proud of this project. We’re working with the Gates Foundation and the Global Health Drug Discovery Institute to look at particularly diseases that affect low-income countries like tuberculosis. And in terms of the models we use, I think we’re all familiar with a large language model. We train it on a sequence of words or sequence of word tokens, and it’s trained to predict the next token. We can do a similar thing, but instead of learning the language of humans, we can learn the language of nature. So in particular, what we’re looking for here is a small organic molecule that we could synthesize in a laboratory that will bind with a particular target protein. It’s called ClpP. And by interfering with that protein, we can arrest the process of tuberculosis. So the goal is to search that space of 1060 molecules and find a new one that has the right properties. Now, the way we do this is to train something that’s essentially a transformer. So it looks like a language model, but the language it’s trained on is a thing called SMILES strings. It’s an idea that’s been around in chemistry for a long time. It’s just a way of taking a three-dimensional molecule and representing it as a one-dimensional sequence of characters. So this is perfect for feeding into a language model. So we take a transformer and we train it on a large database of small organic molecules that are, sort of, typical of the kinds of things you might see in the space of drug molecules. Once that’s been trained, we can now run it generatively. And it will output new molecules. Now, we don’t just want to generate molecules at random because that doesn’t help. We want to generate molecules that bind to this particular binding site on this particular protein. So the next step is we have to tell the model about the protein and the protein binding site. And we do that by giving it information about not actually—well, we do tell it about the whole protein, but we especially give it information about the three-dimensional geometry of the binding site. So we tell about the locations of the atoms that are in the binding site. And we do this in a way that satisfies certain physics constraints, sort of, equivariance properties, it’s called. So if you think about a molecule, if I rotate the molecule in space, the positions of all the atoms change in a complicated way. But it’s the same molecule; it has the same energy and other properties and so on. So we need the right kind of representation. That’s then fed into this transformer using a technique called cross-attention. So internally, the transformer uses self-attention to look at the history of tokens, but it can now use cross-attention to look at another model that understands the proteins. But even that’s not enough. Because in discovering drugs and exploring this gargantuan space and looking for these needles in a haystack, what typically happens [is] you find a hit, a molecule that binds, but now you want to optimize it. You want to make lots of small variations of that molecule in order to make it better and better at binding. So the third piece of the architecture is another module, a thing called a variational autoencoder, that again uses deep learning. But this time, it can take as input an organic molecule that is already known, a hit that’s already known to bind to the site, and that again is fed in through cross-attention. And now the SMILES autoregressive model can now generate a molecule that’s an improvement on the starting molecule and knows about the protein binding. And so what we do is, we start off with the state-of-the-art molecule. And the best example we found is one that’s more than two orders of magnitude stronger binding affinity to the binding pocket, which is a tremendous advance; it’s the state of the art in addressing tuberculosis. And of course, the exciting thing is that this is tested in the laboratory. So this is not just a computer experiment in some sort of benchmark or whatever. We sent a description of the molecule to the laboratories at GHDDI. They synthesized a molecule, characterized it, measured its binding property, and said, well, hey, this is a new state of the art for this target protein. So we’re continuing to work with them to further refine this. There are obviously quite a few more steps. If you know about the drug discovery process, there’s a lot of hurdles you have to get through, including, of course, very important clinical trials, before you have something that can actually be used in humans. But we’re already hugely excited about the fact that we were able to make such a big advance so quickly, in such a short amount of time, compared to the usual drug discovery process. 

STRICKLAND: And while you were looking for that molecule that had the proper characteristics, were you also determining whether it could be manufactured easily, like trying to think about practical realities of bringing this thing out of the computer and into the lab? 

BISHOP: Great question. I mean, you’re hinting there at the fact the discovery process, of course, is a long pipeline. You start with the protein. You have to find a molecule that binds. You then refine the molecule. Now you have to look at ADMET, you know, the absorption, metabolism, and excretion and so on of the molecule. Also make sure that it’s not toxic. But then you need to be able to synthesize it. It’s no good if nobody can make this molecule. So you have to look at that. So, actually, in the AI for Science team, we look at all of these aspects of that drug discovery process. And we find particular areas, especially where there’s, sort of, low-hanging fruit where we can see that deep learning can make a big impact. It doesn’t necessarily help much to take a very easy, fast piece of the pipeline and go work on that. You want to understand, what are the bottlenecks, and can we really unlock those with deep learning? So we’re very interested in that whole process. It’s a fascinating problem. You’ve got a gargantuan search space, and yet you have so many different constraints that need to be met. And deep learning just feels like the perfect tool to go after this problem. 

STRICKLAND: When you talk to the scientists that you collaborate with, is AI changing the kinds of questions that they are able to ask? That they want to ask? 

BISHOP: Oh, for sure. And it’s really empowering. It’s enabling those working in the drug discovery space to, I think, to think in a much more expansive way. If you think about just the kind of acceleration that I talked about from the fifth paradigm, if you go to four-order-of-magnitude acceleration, OK, it may not sound like much of a dent onto the 1060 space, but now when you’re exploring variants of molecules and so on, the ability to explore that space orders of magnitude faster allows you to think much more creatively, allows you to think in a more expansive way about how much of that space you can explore and how efficiently you can explore it. So I think it really is opening up new horizons, and certainly, we have an exciting partnership with Novartis. We’ve been working with them for the last five years, and they’ve been deploying some of our techniques and models in practice for their drug discovery pipeline. We get a lot of great feedback from them about how exciting they’re finding these techniques to use in practice because it is changing the way they go about doing the drug discovery process. 

STRICKLAND: To jump to one other case study, we don’t have to go into great detail on it, but I’m very curious about your Project Aurora, this foundation model for state-of-the-art weather forecasting that, I believe, is 5,000 times faster than traditional physics-based methods. Can you talk a little bit about how that project is evolving, how you imagine these AI forecasting models working with traditional forecasting models, perhaps, or replacing them? 

BISHOP: Yes. So I said most of what we do is down at the molecular level. So this is one of the exceptions. So this is really at the global level, the planetary level. Again, it’s a beautiful example of the fifth paradigm because the way forecasting has been done for a number of decades now and the way most forecasting is done at the moment is through what’s called numerical weather prediction. So again, you have these simple equations. It’s no longer Schrödinger’s equation of atomic physics. It’s now Navier–Stokes equations of fluid flows and a whole bunch of other equations that describe moisture in the atmosphere and the weather and so on. And those equations are solved on a supercomputer. And again, we can think of that numerical simulator now not just as the way you’re going to do the forecasting but actually as the way to generate training data for a deep learning emulator. So several groups have been exploring this over the last couple of years. And again, we see this very robust three-to-four-order-of-magnitude acceleration. But what’s really interesting about Aurora, it’s the world’s first foundation model, so instead of just building an emulator of a particular numerical weather simulator, which is already very interesting, we trained Aurora on a much more diverse set of data and really trying to force it not just to emulate a particular simulator but really, as it were, understand or model the fundamental equations of fluid flows in the Earth’s atmosphere. And then the reason we want to do this is because we now want to take that foundation model and fine-tune it to other downstream applications where there’s much less data. So one example would be pollution flow. So obviously the flow of pollution around the atmosphere is extremely important. But the data is far more sparse. There are far fewer sensors for pollution than there are for, sort of, wind and rain and temperature and so on. And so we were able to achieve state-of-the-art performance in modeling the flow of pollution by leveraging huge data and building this foundation model and then using relatively little data, our pollution monitoring, to build that downstream fine-tuned model. So beautiful example of a foundation model. 

STRICKLAND: That is a cool example. And finally, just to wrap up, what have you seen or heard at NeurIPS that’s gotten you excited? What kind of trends are in the air? What’s the buzz? 

BISHOP: Oh, that’s a great question. I mean, it’s such a huge conference. There’s something like 17,000 people or so here this year, I’ve heard. I think, you know, one of the things that’s happened so far that’s actually given me an enormous amount of energy wasn’t just a technical talk. It was actually an event we had on the first day called Women in Machine Learning. And I was a mentor on one of the mentorship tables, and I found it very energizing just to meet so many people, early-career-stage people, who were very excited about AI for Science and realizing that, you know, it’s not just that I think AI for Science is important. A lot of people are moving into this field now. It is a big frontier for AI. I’m a little biased, perhaps. I think that it’s the most important application area. Intellectually, it’s very exciting because we get to deal with science as well as machine learning. But also if you think about [it], science is really about learning more about the world. And once we learn more about the world, we can then develop aquaculture; we can develop the steam engine; we can develop silicon chips; we can change the world. We can save lives and make the world a better place. And so I think it’s the most fundamental undertaking we have in AI for Science and the thing I loved about the Women in Machine Learning event is that the AI for Science table was just completely swamped with all of these people at early stages of their career, either already working in this field and doing PhDs or wanting to get into it. That was very exciting. 

STRICKLAND: That is really exciting and inspiring, and it gives me a lot of hope. Well, Chris Bishop, thank you so much for joining us today and thanks for a great conversation. 

BISHOP: Thank you. I really appreciate it. 

[MUSIC] 

STRICKLAND: And to our listeners, thanks for tuning in. If you want to learn more about research at Microsoft, you can check out the Microsoft Research website at microsoft.com/research. Until next time.  

[MUSIC FADES]

The post NeurIPS 2024: AI for Science with Chris Bishop appeared first on Microsoft Research.

Read More

Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

Illustrated image of Jindong Wang and Steven Euijong Whang

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Jindong Wang, a senior researcher at Microsoft Research, and Steven Euijong Whang, a tenured associate professor at Korea Advanced Institute of Science and Technology (KAIST), join host Gretchen Huizinga to discuss the paper “ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models,” a spotlight session at this year’s Conference on Neural Information Processing Systems (NeurIPS). ERBench leverages the integrity constraints of relational databases to create LLM benchmarks that can verify model rationale via keywords as well as check for answer correctness.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Jindong Wang, a senior researcher at Microsoft Research, and Steven Whang, a tenured associate professor at the Korea Advanced Institute of Science and Technology. Jindong and Steven are coauthors of a paper called “ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models,” and this paper is a spotlight at this year’s conference on Neural Information Processing Systems, or NeurIPS, in Vancouver, BC, this week. Jindong and Steven, thanks for joining us on Abstracts!


JINDONG WANG: Thank you. Nice to be here.

STEVEN EUIJONG WHANG: It’s great to be here.

HUIZINGA: So, Jindong, I’ll start with you. In just a few sentences, tell us what problem your research addresses and why people should care about it.

JINDONG WANG: OK, everybody knows that with the widespread usage of large language models, hallucination has become a crucial factor of concern. Hallucination occurs when models generate false or nonexistent information. In particular, factual hallucination greatly undermines the reliability of the large language models. To correctly evaluate the hallucination, evaluating the model’s rationale is also important. Up to date, when the paper, you know, was submitted, there were no works dealing with automatic rationale evaluation systematically because, you know, most of them focused on manual evaluation or just using GPT-judge. ERBench is the first one to generate a large language model evaluation benchmark utilizing relational databases. Relational databases are based on the relational data model assuming a fixed schema. The fixed schema enables relational databases to have data integrity that are based on database design theories, so that integrity constraints in relational databases allows better evaluation of the large language models. Functional dependencies allow automatic rationale evaluation using the functional dependency inferred keywords, and foreign key constraints also allow for easy generation of the multi-hop questions, which are usually very complicated to generate with other techniques. So that’s basically what we want to do. So in one sentence, we try to build an automatic evaluation benchmark for evaluation of the hallucination.

HUIZINGA: Steven, give us a quick overview of your research methodology and findings. How did you conduct your research, and what were your major takeaways?

STEVEN EUIJONG WHANG: Sure. So this was a collaboration between our group at KAIST, and Dr. Xing Xie’s group at MSRA (Microsoft Research Asia). KAIST is Korea Advanced Institute of Science and Technology. So we had the privilege to closely work with our LLM expert, Dr. Jindong Wang, here. We also acknowledge the Microsoft Accelerating Foundation Models Research, or AFMR, program for using Azure quota for our experiments. So we had some biweekly meetings for maybe over a year, and at some point, we figured that relational databases could be really important for LLM evaluation. I personally have a background in databases, which I studied at Stanford University as a PhD student. So relational databases have integrity constraints that can be used to better construct complex, in-depth questions and verify answers. So the first ingredient is functional dependencies. So these are constraints where, given a few attributes, you can determine another attribute. So I’ll just give an example because I think that helps the understanding. So suppose that you have, like, a movie table, and in a movie, you have the title of the movie, the year of production, and the director of the movie, and the length of the movie, and so on and so forth. So if you know the title and year of the movie, that pretty much identifies the movie, and you can actually determine the director of the movie, as well. So, for example, if you know that there’s a movie called Star Wars, which is a very popular movie produced in 1977, that determines the director. We know it’s George Lucas, right. So, basically, it’s like a function. It receives the Star Wars 1977 and determines, gives the output, George Lucas. So that’s the first ingredient. Now, the reason this is important is that we can use these functional dependencies to pinpoint critical keywords that an LLM must know to properly answer a given question containing certain attribute values. For example, we may ask the LLM, is there a director of a movie called Star Wars produced in 1977? And the LLM can say yes. And it is the right answer, but we’d like to know if the LLM is knowing what it’s saying, right. And so we look at the rationale. That’s why looking at the rationale is important. We just can’t say it’s doing the correct thing. So if the LLM mentions George Lucas, bingo, that’s a great answer. However, if the LLM mentions some other director, like Steven Spielberg, that’s not a correct rationale. So that’s exactly what we’re trying to evaluate. Functional dependency is key to being able to do that kind of verification.

The second ingredient is foreign key constraints. So foreign key constraint is where one of the attributes in one table can intuitively link to another attribute of another table. So in our movie table, we had the director attribute. Now we may also have a separate table called the director table, and maybe we might have some more information about the director in that table, like the director name, the director’s age, all sorts of information about the director. So foreign key constraint basically requires that if there is some director mentioned in the movie table, it has to be one of the directors in the director table. So this basically links a table to another table. It’s very useful. So using this, what we can do is we can join the two tables, right. So now we can join the movie and director table and generate a bigger table. The reason this is useful is that we can also chain together functional dependencies that I just mentioned into longer functional dependencies. So what this enables is us to construct more complex questions, arbitrarily, that are multi-hop. So using these integrity constraints, we can basically convert any relational database into an LLM benchmark, and this supports continuous evaluation as the database changes. We can also support multimodal questions and also support various prompt engineering techniques.

HUIZINGA: Well, I would ask you to, kind of, drill in on what you found in how ERBench compares to other benchmark tests.

STEVEN EUIJONG WHANG: So we evaluated our benchmark on five domains and performed comprehensive analyses in terms of answer and rationale accuracies and hallucination rates using single, multi-hop, and multimodal questions and also performed prompt engineering and fine-tuning. And what we found is that some LLMs, like GPT-4, are relatively aggressive and good at answering lots of questions. Other LLMs, like Gemini, tend to be a bit more conservative and do not answer as many questions but instead hallucinate less as a result. So the key conclusion is that no LLM, like, totally subsumes the other in all aspects, which is the reason why we use multiple measures. And the key message we want to make is that overall, ERBench is effective in evaluating any LLM’s thought process by pinpointing critical keywords within the rationale.

HUIZINGA: Well, Jindong, back to you. Research settings are one thing, but tell us how your work is significant in real-world settings, and who does this impact most and how?

JINDONG WANG: Relational databases, you know, they are everywhere across various domains. Anyone can easily get access from Google or from Kaggle or even create them targeting the domain or subject that one wants to test the model on. So taking into account that ERBench is the first work to utilize the relational database for generating large language model hallucination benchmarks … so this work will lead a new research direction of integrating database design theories and techniques, a long-studied field—you know, database is very traditional, old, and classic, but, you know, they’re still operating right now—into the large language model field, a recently emerging area.

HUIZINGA: Right. Well, Steven, as we close, I assume there are still a few unanswered questions or unsolved problems in the field. What do you propose to do about those, and what’s next on your research agenda?

STEVEN EUIJONG WHANG: Sure, so the big picture is that we basically proposed the first work to properly evaluate the rationale of LLMs, right. This is very important because LLMs are being used in our everyday lives, and everyone has the question, is the LLM suitable for my task? Can I benefit from the LLM? So it’s very important to verify if the LLM knows what it’s saying. So I just mentioned that we use functional dependencies to pinpoint critical keywords in the rationale. And we believe that’s just the first step. It’s very effective, by the way. So you may have the question, is it enough to just look at, like, the George Lucas within the long rationale? And it turns out 95% of the cases, it is actually effective, so we did human studies and also used GPT-judge to verify that. But these are factual questions and there could be various other questions that require long answers, right. Long rationales. And so the important question is, can we also verify all the rest of the rationales, the complicated rationales, as well? And so in order to properly do that, we need a lot of technology. So first we need to understand the rationales using NLP techniques, and we need to know if it’s properly answering the question, and so on and so forth. And so we believe that there’s a lot of opportunity to expand from that. So we basically, you know, proposed an initial work towards this direction, but we believe that there are many more interesting challenges that remain.

HUIZINGA: Well, Jindong Wang and Steven Whang, thanks for joining us today, and to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts.

[MUSIC]

You can also find it on arXiv and on the NeurIPS website. And if you’re at the NeurIPS conference this week, go to the poster session and talk to the authors! See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang appeared first on Microsoft Research.

Read More

Abstracts: NeurIPS 2024 with Weizhu Chen

Abstracts: NeurIPS 2024 with Weizhu Chen

Illustrated image of Weizhu Chen.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Weizhu Chen, vice president of Microsoft GenAI, joins host Amber Tingle to discuss the paper “Not All Tokens Are What You Need for Pretraining,” an oral presentation at this year’s Conference on Neural Information Processing Systems (NeurIPS). Based on an examination of model training at the token level, Chen and his coauthors present an alternate approach to model pretraining: instead of training language models to predict all tokens, they make a distinction between useful and “noisy” tokens. Doing so, the work shows, improves token efficiency and model performance.

Transcript

[MUSIC]

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES] 

Our guest today is Weizhu Chen. He is vice president of Microsoft GenAI and coauthor of a paper called “Not All Tokens Are What You Need for Pretraining.” This paper is an oral presentation during the 38th annual Conference on Neural Information Processing Systems, also known as NeurIPS, which is happening this week in Vancouver. Weizhu, thank you for joining us today on Abstracts


WEIZHU CHEN: Thank you for having me, Amber. 

TINGLE: So let’s start with a brief overview of your paper. In a couple sentences, tell us about the problem your research addresses and, more importantly, why the research community and beyond should know about this work. 

CHEN: So my team basically in Microsoft GenAI, we are working on model training. So one of the things actually we do in the pretraining, we realize the importance of the data. And we found that actually when we do this kind of data for each of the tokens, some token is more important than the other. That’s one. The other one actually is some token actually is very, very hard to be predicted during the pretraining. So, for example, just like if someone see the text of “Weizhu,” and what’s the next token? It can be “Chen”; it can be any of the last name. So it’s very hard to be predicted. And if we try to enforce a language model to focus on this, kind of, the hard-to-predict token, just like actually it’s going to confuse the language model. There are so many different kinds of the example like this. Just like, for example, the serial number in your UPS. So the focus of this paper is try to identify which token actually is more important for the language model to learn. And actually the other token maybe is just the noise. And how can we try to discriminate the token—which is good token, which is noise token. Basically, you try to understand this kind of dynamic of the tokens. 

TINGLE: How did you conduct this research? 

CHEN: Actually we do a lot of work in the model training, including the pretraining and the post-training. So for the pretraining side, actually the most important thing to us is the data. We also try to understand, how can we leverage the existing data, and how can we create much more data, as well? And data basically is one of the most important thing to build a better foundation model. So we try to understand how much more we can get from the data. And the important thing for the data is about data filtering. So you think about actually in the previous literature, we do the data filtering, for example, just like we build a classifier to classify, OK, this page is more important than the other. And this page actually is a noise because there’s so much noise data in the web. So we just keep the best data to get into the pretraining corpus. And further away, we think about, OK, yeah, so this is … maybe it’s not fine grain enough, so can we try to understand even for the same page we want to keep? So some token is more important than the other. Maybe some token just some noise token. Actually you put this data into the pretraining, it’s going to hurt the model quality. So there is the motivation actually we try to think about.

TINGLE: And what were your major findings? 

CHEN: Our major finding is about basically, definitely this works so well. And it’s so important that actually we are able to get the best token from the corpus and then make it available and try to ask the model during the pretraining to ignore the token we don’t want to get into the model itself. So that is one. The second thing definitely data is the other very important thing. If you’re able to figure out the better way to build a better data is most likely you’re able to build a much better foundation model. The third thing actually is also connected to a lot of other existing work, just like data synthesis, just like distillation, just like data filtering, and so a lot of things are really connected together. And actually, this work, basically, you can associate with also a lot of other work we are working on, just like distillation. You can think about, for example, for this work, we also try to build a model, a reference model—we call as the reference model—to try to identify actually this data, this token, is more important than the other and try to understand the discrepancy between the reference model and the running model, their prediction on each tokens. So you can think about also it’s some kind of the try to distill from the reference model to the existing model, as well. 

TINGLE: Let’s talk a little bit about real-world impact. Who benefits most from this work? And how significant is this within your discipline and even downstream for people using applications? 

CHEN: This actually is very, very fundamental work because just like I share a little bit before, actually we build the data and this data is—build the data much better—is able to build a much better foundation model. If we’re able to build a better model actually is able to benefit so many different kinds of application. This also is going to help us to build a much better small language model. And we can also serve this model even in the edge side, in the client side, in the coding scenario. So we are going to see actually huge impact from this kind of the foundation model if you are able to benefit from building much better training data. 

TINGLE: Are there any unanswered questions or unsolved problems in this area? What’s next on your research agenda? 

CHEN: Yeah, I think that is a very good questions. And definitely there’s a lot of things about how to build a better data [that] is unsolved yet in the literature. And especially because when you do the pretraining, the most important part is the data, but the data is very limited. And how can we make better use from the existing limited data is a big challenge. Because we can increase the model by 10x, but it’s super hard to increase the data by 10x, especially when we want to deal with the high quality of data. The other way, even given the data, how can you identify, especially for this work, the importance of each token to build a much better model? I think all these things are very connected together. To me, actually, data is the oxygen. So there are still so many things we are able to do in the data, including building for even the small language model or the large model. 

TINGLE: Data is oxygen—I love that! So other than that being a key takeaway, is there any other one thing that you’d like our listeners to walk away from this conversation knowing? 

CHEN: I would love to say actually focus more on this kind of data and focus more about how can I get more from the data actually; it is the very important thing. And the other thing actually, we are working on something that’s very exciting. You can feel free to come to join us if you are very interested in this area. 

[MUSIC] 

TINGLE: Well, Weizhu Chen, thank you for joining us today. We really appreciate it. 

CHEN: Thank you. Thank you for having me. 

TINGLE: And thanks to our listeners for tuning in. If you’d like to read the full paper, you may find a link at aka.ms/abstracts. You can also find the paper on arXiv and on the NeurIPS conference website. I’m Amber Tingle from Microsoft Research, and we hope you’ll join us next time on Abstracts

[MUSIC FADES] 

The post Abstracts: NeurIPS 2024 with Weizhu Chen appeared first on Microsoft Research.

Read More

Abstracts: NeurIPS 2024 with Dylan Foster

Abstracts: NeurIPS 2024 with Dylan Foster

Illustrated image of Dylan Foster for the Abstracts series on the Microsoft Research Podcast.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Principal Researcher Dylan Foster joins host Amber Tingle to discuss the paper “Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity,” an oral presentation at this year’s Conference on Neural Information Processing Systems (NeurIPS). In the paper, Foster and his coauthors explore whether well-studied RL algorithms for simple problems can be leveraged to solve RL problems with high-dimensional observations and latent dynamics, part of larger efforts to identify algorithm design principles that can enable agents to learn quickly via trial and error in unfamiliar environments.

Transcript

[MUSIC]

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Our guest today is Dylan Foster. He is a principal researcher at Microsoft Research and coauthor of a paper called “Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity.” The work is among the oral presentations at this year’s Conference on Neural Information Processing Systems, or NeurIPS, in Vancouver. Dylan, welcome and thank you for joining us on the podcast!


DYLAN FOSTER: Thanks for having me.

TINGLE: Let’s start with a brief overview of this paper. Tell us about the problem this work addresses and why the research community should know about it.

FOSTER: So this is a, kind of, a theoretical work on reinforcement learning, or RL. When I say reinforcement learning, broadly speaking, this is talking about the question of how can we design AI agents that are capable of, like, interacting with unknown environments and learning how to solve problems through trial and error. So this is part of some broader agenda we’ve been doing on, kind of, theoretical foundations of RL. And the key questions we’re looking at here are what are called, like, exploration and sample efficiency. So this just means we’re trying to understand, like, what are the algorithm design principles that can allow you to explore an unknown environment and learn as quickly as possible? What we’re doing in this paper is we’re, kind of, looking at, how can you most efficiently solve reinforcement learning problems where you’re faced with very high-dimensional observations, but the underlying dynamics of the system you’re interacting with are simple? So this is a setting that occurs in a lot of natural reinforcement learning and control problems, especially in the context of, like, say, embodied decision-making. So if you think about, say, games like Pong, you know, the state of the game, like, the state of, like, Pong, is extremely simple. It’s just, you know, what is the position and velocity of the ball, and, like, where are the paddles? But what we’d like to be able to do is learn to, you know, like, control or, like, solve games like this from raw pixels or, like, images kind of in the same way that a human would, like, just solve them from vision. So if you look at these types of problems, you know, we call these, like, RL with rich observations or RL with latent dynamics. You know, these are interesting because they, kind of, require you to explore the system, but they also require, you know, representation learning. Like, you want to be able to use neural nets to learn a mapping from, say, the images you see to the latent state of the system. This is a pretty interesting and nontrivial algorithmic problem. And, kind of, what we do in this work is we take a first step towards something like a unified understanding for how to solve these sorts of, like, rich-observation, or latent dynamics, RL problems.

TINGLE: So how did you go about developing this theoretical framework?

FOSTER: Yeah, so if you look at these sort of RL problems with latent dynamics, this is something that’s actually received a lot of investigation in theory. And a lot of this goes back to, kind of, early work from our lab from, like, 2016, 2017 or so. There’s some really interesting results here, but progress was largely on a, like, case-by-case basis, meaning, you know, there are many different ways that you can try to model the latent dynamics of your problem, and, you know, each of these somehow leads to a different algorithm, right. So, like, you know, you think very hard about this modeling assumption. You think about, what would an optimal algorithm look like? And you end up, you know, writing an entire paper about it. And there’s nothing wrong with that per se, but if you want to be able to iterate quickly and, kind of, try different modeling assumptions and see what works in practice, you know, this is not really tenable. It’s just too slow. And so the starting point for this work was to, kind of, try to take a different and more modular approach. So the idea is, you know, there are many, many different types of, sort of, systems or modeling assumptions for the dynamics that have been already studied extensively and have entire papers about them for the simpler setting in which you can directly see the state of the system. And so what we wanted to ask here is, is it possible to use these existing results in more of, like, a modular fashion? Like, if someone has already written a paper on how to optimally solve a particular type of MDP, or Markov decision process, can we just take their algorithm as is and perhaps plug it into some kind of meta-algorithm that can directly, kind of, combine this with representation learning and use it to solve the corresponding rich-observation, or latent dynamics, RL problem?

TINGLE: What were your major findings? What did you learn during this process?

FOSTER: We started by asking the question sort of exactly the way that I just posed it, right. Like, can we take existing algorithms and use them to solve rich-observation RL problems in a modular fashion? And this turned out to be really tricky. Like, there’s a lot of natural algorithms you might try that seem promising at first but don’t exactly work out. And what this, kind of, led us to and, sort of, the first main result in this paper is actually a negative result. So what we actually showed is most, sort of, well-studied types of systems or, like, MDPs that have been studied in, like, the prior literature on RL, even if they’re tractable when you’re able to directly see the state of the system, they can become statistically intractable once you add, sort of, high-dimensional observations to the picture. And statistically tractable here means the amount of interaction that you need, like the amount of, sort of, attempts to explore the system that you need, in order to learn a good decision-making policy becomes, like, very, very large, like much, much larger than the corresponding, sort of, complexity if you were able to directly see the states of the system. You know, you could look at this and say, I guess we’re out of luck. You know, maybe there’s just no hope of solving these sorts of problems. But that’s perhaps a little too pessimistic. You know, really the way you should interpret this result is just that you need more assumptions. And that’s precisely what the, sort of, second result we have in this paper is. So our second result shows that you can, sort of, bypass this impossibility result and, you know, achieve truly modular algorithms under a couple different types of additional assumptions.

TINGLE: Dylan, I’d like to know—and I’m sure our audience would, too—what this work means when it comes to real-world application. What impact will this have on the research community?

FOSTER: Yeah, so maybe I’ll answer that, um, with two different points. The first one is a broader point, which is, why is it important to understand this problem of exploration and sample efficiency in reinforcement learning? If you look at the, sort of, setting we study in this paper—you know, this, like, RL or decision-making with high-dimensional observations—on the empirical side, people have made a huge amount of progress on this problem through deep reinforcement learning. This was what kind of led to these amazing breakthroughs in solving games like Atari in the last decade. But if you look at these results, the gains are somehow more coming from the, like, inductive bias or the, like, generalization abilities of deep learning and not necessarily from the specific algorithms. So, like, current algorithms do not actually explore very deliberately, and so their sample efficiency is very high. Like, it’s hard to draw a one-to-one comparison, but you can argue they need, like, far more experience than a human would to solve these sorts of problems. So it’s not clear that we’re really anywhere near the ceiling of what can be achieved in terms of, like, how efficiently can you have, you know, an agent learn to solve new problems from trial and error. And I think better algorithms here could potentially be, like, transformative in a lot of different domains. To get into this specific work, I think there’s a couple of important takeaways for researchers. One is that by giving this impossibility result that shows that RL with latent dynamics is impossible without further assumptions, we’re kind of narrowing down the search space where other researchers can look for efficient algorithms. The second takeaway is, you know, we are showing that this problem becomes tractable when you make additional assumptions. But I view these more as, like, a proof of concept. Like, we’re kind of, showing for the first time that it is possible to do something nontrivial, but I think a lot more work and research will be required in order to like, you know, build on this and take this to something that can lead to, like, practical algorithms.

TINGLE: Well, Dylan Foster, thank you for joining us today to discuss your paper on reinforcement learning under latent dynamics. We certainly appreciate it.

FOSTER: Thanks a lot. Thanks for having me.

[MUSIC]

TINGLE: And to our listeners, thank you all for tuning in. If you’d like to read Dylan’s paper, you may find a link at aka.ms/abstracts. You can also find the paper on arXiv and on the NeurIPS conference website. I’m Amber Tingle from Microsoft Research, and we hope you’ll join us next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Dylan Foster appeared first on Microsoft Research.

Read More