NeurIPS 2024: AI for Science with Chris Bishop

NeurIPS 2024: AI for Science with Chris Bishop

Illustrated headshots of Chris Bishop and Eliza Strickland.

The Microsoft Research Podcast offers its audience a unique view into the technical advances being pursued at Microsoft through the insights and personal experiences of the people committed to those pursuits. 

In this special edition of the podcast, Technical Fellow and Microsoft Research AI for Science Director Chris Bishop joins guest host Eliza Strickland of IEEE Spectrum at the 38th annual Conference on Neural Information Processing Systems (NeurIPS) to talk about deep learning’s potential to improve the speed and scale at which scientific advancements can be made. Bishop discusses the factors considered when choosing which scientific challenges to tackle with AI; the impact foundation models are having right now in areas such as drug discovery and weather forecasting; and the work at NeurIPS that he’s excited about.

Learn more:

From forecasting storms to designing molecules: How new AI foundation models can speed up scientific discovery (opens in new tab)
Microsoft Source blog, October 2024 

Introducing Aurora: The first large-scale foundation model of the atmosphere
Microsoft Research blog, June 2024 

GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases 
Microsoft Research blog, January 2024 

AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop 
Microsoft Research Podcast, December 2023 

AI4Science to empower the fifth paradigm of scientific discovery 
Microsoft Research blog, July 2022 

Novartis empowers scientists with AI to speed the discovery and development of breakthrough medicines (opens in new tab) 
Microsoft Source, November 2021 

Bringing together deep bioscience and AI to help patients worldwide: Novartis and Microsoft work to reinvent treatment discovery and development (opens in new tab) 
Official Microsoft Blog, October 2019 

Transcript

[MUSIC] 

ELIZA STRICKLAND: Welcome to the Microsoft Research Podcast, where Microsoft’s leading researchers bring you to the cutting edge. This series of conversations showcases the technical advances being pursued at Microsoft through the insights and experiences of the people driving them.  

I’m Eliza Strickland, a senior editor at IEEE Spectrum and your guest host for a special edition of the podcast.  

[MUSIC FADES] 

Joining me today in the Microsoft Booth at the 38th annual Conference on Neural Information Processing Systems, or NeurIPS, is Chris Bishop. Chris is a Microsoft technical fellow and the director of Microsoft Research AI for Science. Chris is with me for one of our two on-site conversations that we’re having here at the conference.  

Chris, welcome to the podcast.


CHRIS BISHOP: Thanks, Eliza. Really great to join you. 

STRICKLAND: How did your long career in machine learning lead you to this focus on AI for Science, and were there any pivotal moments when you started to think that, hey, this deep learning thing, it’s going to change the way scientific discovery happens? 

BISHOP: Oh, that’s such a great question. I think this is like my career coming full circle, really. I started out studying physics at Oxford, and then I did a PhD in quantum field theory. And then I moved into the fusion program. I wanted to do something of practical value, [LAUGHTER] so I worked on nuclear fusion for about seven or eight years doing theoretical physics, and then that was about the time that Geoff Hinton published his backprop paper. And it really caught my imagination as an exciting approach to artificial intelligence that might actually yield some progress. So that was, kind of, 35 years ago, and I moved into the field of machine learning. And, actually, the way I made that transition was by applying neural networks to fusion. I was working at the JET experiment, which was the world’s largest fusion experiment. It was sort of big data in its day. And so I had to, first of all, teach myself to program.  

STRICKLAND: [LAUGHS] Right.  

BISHOP: I was a pencil-and-paper theoretician up to that point. Persuade my boss to buy me a workstation and then started to play with these neural nets. So right from the get-go, I was applying machine learning 35 years ago to data from science experiments. And that was a great on-ramp for me. And then, eventually, I just got so distracted, I decided I wanted to build my career in machine learning. Spent a few years as a research professor and then joined Microsoft 27 years ago, when Microsoft opened its first research lab outside the US in Cambridge, UK, and have been there very happily ever since. Went on to become lab director. But about three or four years ago, I realized that not only was deep learning transforming so many different things, but I felt it was especially relevant to scientific discovery. And so I had an opportunity to pitch to our chief technology officer to go start a new team. And he was very excited by this. So just over two and a half years ago now, we set up Microsoft Research AI for Science, and it’s a global team, and it, sort of, does what it says on the tin. 

STRICKLAND: So you’ve said that AI could usher in a fifth paradigm of scientific discovery, which builds upon the ideas of Turing Award–winner Jim Gray, who described four stages in the evolution of science. Can you briefly explain the four prior paradigms and then tell us about what makes this stage different? 

BISHOP: Yeah, sure. So it was a nice insight by Jim. He said, well, of course, the first paradigm of scientific discovery was really the empirical one. I tend to think of some cave dweller picking up a big rock and a small rock and letting go of them at the same time and thinking the big rock will hit the ground first … 

STRICKLAND: [LAUGHS] Right … 

BISHOP: … discovering they land together. And this is interesting. They’ve discovered a, sort of, pattern irregularity in nature, and even today, the first paradigm is in a sense the prime paradigm. It’s the most important one because at the end of the day, it’s experimental results that determine the truth, if you like. So that’s the first paradigm. And it continues to be of critical importance today. And then the second paradigm really emerged in the 17th century. When Newton discovered the laws of motion and the law of gravity, and not only did he discover the equations but this, sort of, remarkable fact that nature can even be described by equations, right. It’s not obvious that this would be true, but it turns out that, you know, the world around us can be described by very simple equations that you can write on a T-shirt. And so in the 19th century, James Clerk Maxwell discovered some simple equations that describe the whole of electricity and magnetism, electromagnetic waves, and so on. And then very importantly, the beginning of the 20th century, we had this remarkable breakthrough in quantum physics. So again down at the molecular—the atomic—level, the world is described with exquisite precision by Schrödinger’s equation. And so this was the second paradigm, the theoretical. That the world is described with incredible precision of a huge range of length and time by very simple equations.  

But of course, there’s a catch, which is those equations are very hard to solve. And so the third paradigm really began, I guess, sort of, in the ’50s and ’60s, the development of digital computers. And, actually, the very first use of digital computers was to simulate physics, and it’s been at the core of digital computing right up to the present day. And so what you’re doing there is using a computer to go with a numerical algorithm to solve those very simple equations but solve them in a practical setting. And so that’s, I’ll refer to that as simulation. That’s the third paradigm. And that’s proven to be tremendously powerful. If you look up the weather forecast on your phone today, it’s done by numerical weather forecasting, solving in those case Navier-Stokes equations using big numerical simulators. What Jim Gray observed, though, really emerging at the beginning of the 21st century was what he called the fourth paradigm, or data-intensive scientific discovery. So this is the era of big data. Think of particle physics at the CERN accelerator, for example, generating colossal amounts of data in real time. And that data can then be processed and filtered. We can do statistics on it. But of course, we can do machine learning on that data. And so machine learning feeds off large data. And so the fourth paradigm really is dominated today by machine learning. And again that remains tremendously important.  

What I noticed, though, is that there’s again another framework. We call it the fifth paradigm. Again, it goes back to those fundamental equations. But again, it’s driven by computation, and it’s the idea that we can train machine learning systems not using the empirical data of the fourth paradigm but instead using the results of simulation. So the output of the third paradigm. So think of it this way. You want to predict the property of some molecule, let’s say. You could in principle solve Schrödinger’s equation on a digital computer; it’d be very expensive. And let’s say you want to screen hundreds of millions of molecules. That’s going to get far too costly. So instead, what you can do is have a mindset shift. You can think of that simulator not as a tool to predict the molecule’s properties directly but instead as a way of generating synthetic training data. And then you use that training data to train a deep learning system to give what I like to call an emulator, an emulator of the simulator. Once it’s trained, that emulator is fast. It’s usually three to four orders of magnitude faster than the simulator. So if you’re going to do something over and over again, that three-to-four-order-of-magnitude acceleration is tremendously disruptive. And what’s really interesting is we see that fifth paradigm occur in many, many different places. The idea goes back a long way. The, actually, the last project that I worked on before I left the fusion program was to do what was the world’s first-ever real-time control of a tokamak fusion plasma using a neural net and the computers of the day. But the processors were just far too slow, long before GPUs, and so on. And so it wasn’t possible to solve the equations. In that case, it was called the Grad-Shafranov equation. Again, a simple differential equation you could write on a T-shirt, but solving it was expensive on a computer. We were about a million times too slow to solve it directly in real time. And so instead, we generated lots and lots of solutions. We used those solutions to train a very simple neural network, not a deep network, just a simple two-layer network back in the day, and then we implemented that in special hardware and did real-time feedback control. So that was an example of the fifth paradigm from, you know, a quarter of a century ago. But of course, deep learning just tremendously expands the range of applicability. So today we’re using the fifth paradigm in many, many different scenarios. And time and time again, we see these four-orders-of-magnitude acceleration. So I think it’s worthy of thinking of that as a new paradigm because it’s so pervasive and so ubiquitous. 

STRICKLAND: So how do you identify fields of science and particular problems that are amenable to this kind of AI assistance? Is it all about availability of data or the need for that kind of speed up? 

BISHOP: So there are lots of factors that go into this. And when I think about AI for Science actually, the space of opportunity is colossal because science is, science is really just understanding more about the world around us. And so the range of possibilities is daunting really. So in choosing what to work on, I think there are several factors. Yes, of course, data is important, but very interestingly, we can use experimental data or we can generate synthetic data by running simulators. So we’re a big fan of the fifth paradigm. But I think another factor—and this is particularly at Microsoft—is thinking about, how can we have real-world impact at scale? Because that’s our job, is to make the world a better place and to do so at a planetary scale. And so we’ve settled on, for the most part, working at the molecular level. So if you think about the number of different ways of combining atoms together to make new stable configurations of atoms, it’s gargantuan. I mean, the number of just small molecules, small organic molecules, that are potential drug candidates is about 1060. It’s about the same as the number of atoms in the solar system. The number of proteins, maybe the fourth power of the number of atoms in the universe, or something crazy. So you’ve got this gargantuan space to search, and within that space, for sure, there’ll be all sorts of interesting molecules, materials, new drugs, new therapies, new materials for carbon capture, new kinds of batteries, new photovoltaics. The list is endless because everything around us is made of atoms, including our own bodies. So the potential just in the molecular space is gargantuan. And so that’s why we focus there. 

STRICKLAND: It’s a big focus. [LAUGHTER] 

BISHOP: It’s a broad focus, still, yes. 

STRICKLAND: So let’s take one of these case studies then. In a project on drug discovery, you worked with the Global Health Drug Discovery Institute on molecules that would interact with tuberculosis and coronaviruses, I think. And you found, I think, candidate molecules in five months instead of several years. Can you talk about what models you used in this work and how they helped you get this vastly sped up process? 

BISHOP: Sure. Yes. We’re very proud of this project. We’re working with the Gates Foundation and the Global Health Drug Discovery Institute to look at particularly diseases that affect low-income countries like tuberculosis. And in terms of the models we use, I think we’re all familiar with a large language model. We train it on a sequence of words or sequence of word tokens, and it’s trained to predict the next token. We can do a similar thing, but instead of learning the language of humans, we can learn the language of nature. So in particular, what we’re looking for here is a small organic molecule that we could synthesize in a laboratory that will bind with a particular target protein. It’s called ClpP. And by interfering with that protein, we can arrest the process of tuberculosis. So the goal is to search that space of 1060 molecules and find a new one that has the right properties. Now, the way we do this is to train something that’s essentially a transformer. So it looks like a language model, but the language it’s trained on is a thing called SMILES strings. It’s an idea that’s been around in chemistry for a long time. It’s just a way of taking a three-dimensional molecule and representing it as a one-dimensional sequence of characters. So this is perfect for feeding into a language model. So we take a transformer and we train it on a large database of small organic molecules that are, sort of, typical of the kinds of things you might see in the space of drug molecules. Once that’s been trained, we can now run it generatively. And it will output new molecules. Now, we don’t just want to generate molecules at random because that doesn’t help. We want to generate molecules that bind to this particular binding site on this particular protein. So the next step is we have to tell the model about the protein and the protein binding site. And we do that by giving it information about not actually—well, we do tell it about the whole protein, but we especially give it information about the three-dimensional geometry of the binding site. So we tell about the locations of the atoms that are in the binding site. And we do this in a way that satisfies certain physics constraints, sort of, equivariance properties, it’s called. So if you think about a molecule, if I rotate the molecule in space, the positions of all the atoms change in a complicated way. But it’s the same molecule; it has the same energy and other properties and so on. So we need the right kind of representation. That’s then fed into this transformer using a technique called cross-attention. So internally, the transformer uses self-attention to look at the history of tokens, but it can now use cross-attention to look at another model that understands the proteins. But even that’s not enough. Because in discovering drugs and exploring this gargantuan space and looking for these needles in a haystack, what typically happens [is] you find a hit, a molecule that binds, but now you want to optimize it. You want to make lots of small variations of that molecule in order to make it better and better at binding. So the third piece of the architecture is another module, a thing called a variational autoencoder, that again uses deep learning. But this time, it can take as input an organic molecule that is already known, a hit that’s already known to bind to the site, and that again is fed in through cross-attention. And now the SMILES autoregressive model can now generate a molecule that’s an improvement on the starting molecule and knows about the protein binding. And so what we do is, we start off with the state-of-the-art molecule. And the best example we found is one that’s more than two orders of magnitude stronger binding affinity to the binding pocket, which is a tremendous advance; it’s the state of the art in addressing tuberculosis. And of course, the exciting thing is that this is tested in the laboratory. So this is not just a computer experiment in some sort of benchmark or whatever. We sent a description of the molecule to the laboratories at GHDDI. They synthesized a molecule, characterized it, measured its binding property, and said, well, hey, this is a new state of the art for this target protein. So we’re continuing to work with them to further refine this. There are obviously quite a few more steps. If you know about the drug discovery process, there’s a lot of hurdles you have to get through, including, of course, very important clinical trials, before you have something that can actually be used in humans. But we’re already hugely excited about the fact that we were able to make such a big advance so quickly, in such a short amount of time, compared to the usual drug discovery process. 

STRICKLAND: And while you were looking for that molecule that had the proper characteristics, were you also determining whether it could be manufactured easily, like trying to think about practical realities of bringing this thing out of the computer and into the lab? 

BISHOP: Great question. I mean, you’re hinting there at the fact the discovery process, of course, is a long pipeline. You start with the protein. You have to find a molecule that binds. You then refine the molecule. Now you have to look at ADMET, you know, the absorption, metabolism, and excretion and so on of the molecule. Also make sure that it’s not toxic. But then you need to be able to synthesize it. It’s no good if nobody can make this molecule. So you have to look at that. So, actually, in the AI for Science team, we look at all of these aspects of that drug discovery process. And we find particular areas, especially where there’s, sort of, low-hanging fruit where we can see that deep learning can make a big impact. It doesn’t necessarily help much to take a very easy, fast piece of the pipeline and go work on that. You want to understand, what are the bottlenecks, and can we really unlock those with deep learning? So we’re very interested in that whole process. It’s a fascinating problem. You’ve got a gargantuan search space, and yet you have so many different constraints that need to be met. And deep learning just feels like the perfect tool to go after this problem. 

STRICKLAND: When you talk to the scientists that you collaborate with, is AI changing the kinds of questions that they are able to ask? That they want to ask? 

BISHOP: Oh, for sure. And it’s really empowering. It’s enabling those working in the drug discovery space to, I think, to think in a much more expansive way. If you think about just the kind of acceleration that I talked about from the fifth paradigm, if you go to four-order-of-magnitude acceleration, OK, it may not sound like much of a dent onto the 1060 space, but now when you’re exploring variants of molecules and so on, the ability to explore that space orders of magnitude faster allows you to think much more creatively, allows you to think in a more expansive way about how much of that space you can explore and how efficiently you can explore it. So I think it really is opening up new horizons, and certainly, we have an exciting partnership with Novartis. We’ve been working with them for the last five years, and they’ve been deploying some of our techniques and models in practice for their drug discovery pipeline. We get a lot of great feedback from them about how exciting they’re finding these techniques to use in practice because it is changing the way they go about doing the drug discovery process. 

STRICKLAND: To jump to one other case study, we don’t have to go into great detail on it, but I’m very curious about your Project Aurora, this foundation model for state-of-the-art weather forecasting that, I believe, is 5,000 times faster than traditional physics-based methods. Can you talk a little bit about how that project is evolving, how you imagine these AI forecasting models working with traditional forecasting models, perhaps, or replacing them? 

BISHOP: Yes. So I said most of what we do is down at the molecular level. So this is one of the exceptions. So this is really at the global level, the planetary level. Again, it’s a beautiful example of the fifth paradigm because the way forecasting has been done for a number of decades now and the way most forecasting is done at the moment is through what’s called numerical weather prediction. So again, you have these simple equations. It’s no longer Schrödinger’s equation of atomic physics. It’s now Navier–Stokes equations of fluid flows and a whole bunch of other equations that describe moisture in the atmosphere and the weather and so on. And those equations are solved on a supercomputer. And again, we can think of that numerical simulator now not just as the way you’re going to do the forecasting but actually as the way to generate training data for a deep learning emulator. So several groups have been exploring this over the last couple of years. And again, we see this very robust three-to-four-order-of-magnitude acceleration. But what’s really interesting about Aurora, it’s the world’s first foundation model, so instead of just building an emulator of a particular numerical weather simulator, which is already very interesting, we trained Aurora on a much more diverse set of data and really trying to force it not just to emulate a particular simulator but really, as it were, understand or model the fundamental equations of fluid flows in the Earth’s atmosphere. And then the reason we want to do this is because we now want to take that foundation model and fine-tune it to other downstream applications where there’s much less data. So one example would be pollution flow. So obviously the flow of pollution around the atmosphere is extremely important. But the data is far more sparse. There are far fewer sensors for pollution than there are for, sort of, wind and rain and temperature and so on. And so we were able to achieve state-of-the-art performance in modeling the flow of pollution by leveraging huge data and building this foundation model and then using relatively little data, our pollution monitoring, to build that downstream fine-tuned model. So beautiful example of a foundation model. 

STRICKLAND: That is a cool example. And finally, just to wrap up, what have you seen or heard at NeurIPS that’s gotten you excited? What kind of trends are in the air? What’s the buzz? 

BISHOP: Oh, that’s a great question. I mean, it’s such a huge conference. There’s something like 17,000 people or so here this year, I’ve heard. I think, you know, one of the things that’s happened so far that’s actually given me an enormous amount of energy wasn’t just a technical talk. It was actually an event we had on the first day called Women in Machine Learning. And I was a mentor on one of the mentorship tables, and I found it very energizing just to meet so many people, early-career-stage people, who were very excited about AI for Science and realizing that, you know, it’s not just that I think AI for Science is important. A lot of people are moving into this field now. It is a big frontier for AI. I’m a little biased, perhaps. I think that it’s the most important application area. Intellectually, it’s very exciting because we get to deal with science as well as machine learning. But also if you think about [it], science is really about learning more about the world. And once we learn more about the world, we can then develop aquaculture; we can develop the steam engine; we can develop silicon chips; we can change the world. We can save lives and make the world a better place. And so I think it’s the most fundamental undertaking we have in AI for Science and the thing I loved about the Women in Machine Learning event is that the AI for Science table was just completely swamped with all of these people at early stages of their career, either already working in this field and doing PhDs or wanting to get into it. That was very exciting. 

STRICKLAND: That is really exciting and inspiring, and it gives me a lot of hope. Well, Chris Bishop, thank you so much for joining us today and thanks for a great conversation. 

BISHOP: Thank you. I really appreciate it. 

[MUSIC] 

STRICKLAND: And to our listeners, thanks for tuning in. If you want to learn more about research at Microsoft, you can check out the Microsoft Research website at microsoft.com/research. Until next time.  

[MUSIC FADES]

The post NeurIPS 2024: AI for Science with Chris Bishop appeared first on Microsoft Research.

Read More

Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

Illustrated image of Jindong Wang and Steven Euijong Whang

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Jindong Wang, a senior researcher at Microsoft Research, and Steven Euijong Whang, a tenured associate professor at Korea Advanced Institute of Science and Technology (KAIST), join host Gretchen Huizinga to discuss the paper “ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models,” a spotlight session at this year’s Conference on Neural Information Processing Systems (NeurIPS). ERBench leverages the integrity constraints of relational databases to create LLM benchmarks that can verify model rationale via keywords as well as check for answer correctness.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Jindong Wang, a senior researcher at Microsoft Research, and Steven Whang, a tenured associate professor at the Korea Advanced Institute of Science and Technology. Jindong and Steven are coauthors of a paper called “ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models,” and this paper is a spotlight at this year’s conference on Neural Information Processing Systems, or NeurIPS, in Vancouver, BC, this week. Jindong and Steven, thanks for joining us on Abstracts!


JINDONG WANG: Thank you. Nice to be here.

STEVEN EUIJONG WHANG: It’s great to be here.

HUIZINGA: So, Jindong, I’ll start with you. In just a few sentences, tell us what problem your research addresses and why people should care about it.

JINDONG WANG: OK, everybody knows that with the widespread usage of large language models, hallucination has become a crucial factor of concern. Hallucination occurs when models generate false or nonexistent information. In particular, factual hallucination greatly undermines the reliability of the large language models. To correctly evaluate the hallucination, evaluating the model’s rationale is also important. Up to date, when the paper, you know, was submitted, there were no works dealing with automatic rationale evaluation systematically because, you know, most of them focused on manual evaluation or just using GPT-judge. ERBench is the first one to generate a large language model evaluation benchmark utilizing relational databases. Relational databases are based on the relational data model assuming a fixed schema. The fixed schema enables relational databases to have data integrity that are based on database design theories, so that integrity constraints in relational databases allows better evaluation of the large language models. Functional dependencies allow automatic rationale evaluation using the functional dependency inferred keywords, and foreign key constraints also allow for easy generation of the multi-hop questions, which are usually very complicated to generate with other techniques. So that’s basically what we want to do. So in one sentence, we try to build an automatic evaluation benchmark for evaluation of the hallucination.

HUIZINGA: Steven, give us a quick overview of your research methodology and findings. How did you conduct your research, and what were your major takeaways?

STEVEN EUIJONG WHANG: Sure. So this was a collaboration between our group at KAIST, and Dr. Xing Xie’s group at MSRA (Microsoft Research Asia). KAIST is Korea Advanced Institute of Science and Technology. So we had the privilege to closely work with our LLM expert, Dr. Jindong Wang, here. We also acknowledge the Microsoft Accelerating Foundation Models Research, or AFMR, program for using Azure quota for our experiments. So we had some biweekly meetings for maybe over a year, and at some point, we figured that relational databases could be really important for LLM evaluation. I personally have a background in databases, which I studied at Stanford University as a PhD student. So relational databases have integrity constraints that can be used to better construct complex, in-depth questions and verify answers. So the first ingredient is functional dependencies. So these are constraints where, given a few attributes, you can determine another attribute. So I’ll just give an example because I think that helps the understanding. So suppose that you have, like, a movie table, and in a movie, you have the title of the movie, the year of production, and the director of the movie, and the length of the movie, and so on and so forth. So if you know the title and year of the movie, that pretty much identifies the movie, and you can actually determine the director of the movie, as well. So, for example, if you know that there’s a movie called Star Wars, which is a very popular movie produced in 1977, that determines the director. We know it’s George Lucas, right. So, basically, it’s like a function. It receives the Star Wars 1977 and determines, gives the output, George Lucas. So that’s the first ingredient. Now, the reason this is important is that we can use these functional dependencies to pinpoint critical keywords that an LLM must know to properly answer a given question containing certain attribute values. For example, we may ask the LLM, is there a director of a movie called Star Wars produced in 1977? And the LLM can say yes. And it is the right answer, but we’d like to know if the LLM is knowing what it’s saying, right. And so we look at the rationale. That’s why looking at the rationale is important. We just can’t say it’s doing the correct thing. So if the LLM mentions George Lucas, bingo, that’s a great answer. However, if the LLM mentions some other director, like Steven Spielberg, that’s not a correct rationale. So that’s exactly what we’re trying to evaluate. Functional dependency is key to being able to do that kind of verification.

The second ingredient is foreign key constraints. So foreign key constraint is where one of the attributes in one table can intuitively link to another attribute of another table. So in our movie table, we had the director attribute. Now we may also have a separate table called the director table, and maybe we might have some more information about the director in that table, like the director name, the director’s age, all sorts of information about the director. So foreign key constraint basically requires that if there is some director mentioned in the movie table, it has to be one of the directors in the director table. So this basically links a table to another table. It’s very useful. So using this, what we can do is we can join the two tables, right. So now we can join the movie and director table and generate a bigger table. The reason this is useful is that we can also chain together functional dependencies that I just mentioned into longer functional dependencies. So what this enables is us to construct more complex questions, arbitrarily, that are multi-hop. So using these integrity constraints, we can basically convert any relational database into an LLM benchmark, and this supports continuous evaluation as the database changes. We can also support multimodal questions and also support various prompt engineering techniques.

HUIZINGA: Well, I would ask you to, kind of, drill in on what you found in how ERBench compares to other benchmark tests.

STEVEN EUIJONG WHANG: So we evaluated our benchmark on five domains and performed comprehensive analyses in terms of answer and rationale accuracies and hallucination rates using single, multi-hop, and multimodal questions and also performed prompt engineering and fine-tuning. And what we found is that some LLMs, like GPT-4, are relatively aggressive and good at answering lots of questions. Other LLMs, like Gemini, tend to be a bit more conservative and do not answer as many questions but instead hallucinate less as a result. So the key conclusion is that no LLM, like, totally subsumes the other in all aspects, which is the reason why we use multiple measures. And the key message we want to make is that overall, ERBench is effective in evaluating any LLM’s thought process by pinpointing critical keywords within the rationale.

HUIZINGA: Well, Jindong, back to you. Research settings are one thing, but tell us how your work is significant in real-world settings, and who does this impact most and how?

JINDONG WANG: Relational databases, you know, they are everywhere across various domains. Anyone can easily get access from Google or from Kaggle or even create them targeting the domain or subject that one wants to test the model on. So taking into account that ERBench is the first work to utilize the relational database for generating large language model hallucination benchmarks … so this work will lead a new research direction of integrating database design theories and techniques, a long-studied field—you know, database is very traditional, old, and classic, but, you know, they’re still operating right now—into the large language model field, a recently emerging area.

HUIZINGA: Right. Well, Steven, as we close, I assume there are still a few unanswered questions or unsolved problems in the field. What do you propose to do about those, and what’s next on your research agenda?

STEVEN EUIJONG WHANG: Sure, so the big picture is that we basically proposed the first work to properly evaluate the rationale of LLMs, right. This is very important because LLMs are being used in our everyday lives, and everyone has the question, is the LLM suitable for my task? Can I benefit from the LLM? So it’s very important to verify if the LLM knows what it’s saying. So I just mentioned that we use functional dependencies to pinpoint critical keywords in the rationale. And we believe that’s just the first step. It’s very effective, by the way. So you may have the question, is it enough to just look at, like, the George Lucas within the long rationale? And it turns out 95% of the cases, it is actually effective, so we did human studies and also used GPT-judge to verify that. But these are factual questions and there could be various other questions that require long answers, right. Long rationales. And so the important question is, can we also verify all the rest of the rationales, the complicated rationales, as well? And so in order to properly do that, we need a lot of technology. So first we need to understand the rationales using NLP techniques, and we need to know if it’s properly answering the question, and so on and so forth. And so we believe that there’s a lot of opportunity to expand from that. So we basically, you know, proposed an initial work towards this direction, but we believe that there are many more interesting challenges that remain.

HUIZINGA: Well, Jindong Wang and Steven Whang, thanks for joining us today, and to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts.

[MUSIC]

You can also find it on arXiv and on the NeurIPS website. And if you’re at the NeurIPS conference this week, go to the poster session and talk to the authors! See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang appeared first on Microsoft Research.

Read More

Abstracts: NeurIPS 2024 with Weizhu Chen

Abstracts: NeurIPS 2024 with Weizhu Chen

Illustrated image of Weizhu Chen.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Weizhu Chen, vice president of Microsoft GenAI, joins host Amber Tingle to discuss the paper “Not All Tokens Are What You Need for Pretraining,” an oral presentation at this year’s Conference on Neural Information Processing Systems (NeurIPS). Based on an examination of model training at the token level, Chen and his coauthors present an alternate approach to model pretraining: instead of training language models to predict all tokens, they make a distinction between useful and “noisy” tokens. Doing so, the work shows, improves token efficiency and model performance.

Transcript

[MUSIC]

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES] 

Our guest today is Weizhu Chen. He is vice president of Microsoft GenAI and coauthor of a paper called “Not All Tokens Are What You Need for Pretraining.” This paper is an oral presentation during the 38th annual Conference on Neural Information Processing Systems, also known as NeurIPS, which is happening this week in Vancouver. Weizhu, thank you for joining us today on Abstracts


WEIZHU CHEN: Thank you for having me, Amber. 

TINGLE: So let’s start with a brief overview of your paper. In a couple sentences, tell us about the problem your research addresses and, more importantly, why the research community and beyond should know about this work. 

CHEN: So my team basically in Microsoft GenAI, we are working on model training. So one of the things actually we do in the pretraining, we realize the importance of the data. And we found that actually when we do this kind of data for each of the tokens, some token is more important than the other. That’s one. The other one actually is some token actually is very, very hard to be predicted during the pretraining. So, for example, just like if someone see the text of “Weizhu,” and what’s the next token? It can be “Chen”; it can be any of the last name. So it’s very hard to be predicted. And if we try to enforce a language model to focus on this, kind of, the hard-to-predict token, just like actually it’s going to confuse the language model. There are so many different kinds of the example like this. Just like, for example, the serial number in your UPS. So the focus of this paper is try to identify which token actually is more important for the language model to learn. And actually the other token maybe is just the noise. And how can we try to discriminate the token—which is good token, which is noise token. Basically, you try to understand this kind of dynamic of the tokens. 

TINGLE: How did you conduct this research? 

CHEN: Actually we do a lot of work in the model training, including the pretraining and the post-training. So for the pretraining side, actually the most important thing to us is the data. We also try to understand, how can we leverage the existing data, and how can we create much more data, as well? And data basically is one of the most important thing to build a better foundation model. So we try to understand how much more we can get from the data. And the important thing for the data is about data filtering. So you think about actually in the previous literature, we do the data filtering, for example, just like we build a classifier to classify, OK, this page is more important than the other. And this page actually is a noise because there’s so much noise data in the web. So we just keep the best data to get into the pretraining corpus. And further away, we think about, OK, yeah, so this is … maybe it’s not fine grain enough, so can we try to understand even for the same page we want to keep? So some token is more important than the other. Maybe some token just some noise token. Actually you put this data into the pretraining, it’s going to hurt the model quality. So there is the motivation actually we try to think about.

TINGLE: And what were your major findings? 

CHEN: Our major finding is about basically, definitely this works so well. And it’s so important that actually we are able to get the best token from the corpus and then make it available and try to ask the model during the pretraining to ignore the token we don’t want to get into the model itself. So that is one. The second thing definitely data is the other very important thing. If you’re able to figure out the better way to build a better data is most likely you’re able to build a much better foundation model. The third thing actually is also connected to a lot of other existing work, just like data synthesis, just like distillation, just like data filtering, and so a lot of things are really connected together. And actually, this work, basically, you can associate with also a lot of other work we are working on, just like distillation. You can think about, for example, for this work, we also try to build a model, a reference model—we call as the reference model—to try to identify actually this data, this token, is more important than the other and try to understand the discrepancy between the reference model and the running model, their prediction on each tokens. So you can think about also it’s some kind of the try to distill from the reference model to the existing model, as well. 

TINGLE: Let’s talk a little bit about real-world impact. Who benefits most from this work? And how significant is this within your discipline and even downstream for people using applications? 

CHEN: This actually is very, very fundamental work because just like I share a little bit before, actually we build the data and this data is—build the data much better—is able to build a much better foundation model. If we’re able to build a better model actually is able to benefit so many different kinds of application. This also is going to help us to build a much better small language model. And we can also serve this model even in the edge side, in the client side, in the coding scenario. So we are going to see actually huge impact from this kind of the foundation model if you are able to benefit from building much better training data. 

TINGLE: Are there any unanswered questions or unsolved problems in this area? What’s next on your research agenda? 

CHEN: Yeah, I think that is a very good questions. And definitely there’s a lot of things about how to build a better data [that] is unsolved yet in the literature. And especially because when you do the pretraining, the most important part is the data, but the data is very limited. And how can we make better use from the existing limited data is a big challenge. Because we can increase the model by 10x, but it’s super hard to increase the data by 10x, especially when we want to deal with the high quality of data. The other way, even given the data, how can you identify, especially for this work, the importance of each token to build a much better model? I think all these things are very connected together. To me, actually, data is the oxygen. So there are still so many things we are able to do in the data, including building for even the small language model or the large model. 

TINGLE: Data is oxygen—I love that! So other than that being a key takeaway, is there any other one thing that you’d like our listeners to walk away from this conversation knowing? 

CHEN: I would love to say actually focus more on this kind of data and focus more about how can I get more from the data actually; it is the very important thing. And the other thing actually, we are working on something that’s very exciting. You can feel free to come to join us if you are very interested in this area. 

[MUSIC] 

TINGLE: Well, Weizhu Chen, thank you for joining us today. We really appreciate it. 

CHEN: Thank you. Thank you for having me. 

TINGLE: And thanks to our listeners for tuning in. If you’d like to read the full paper, you may find a link at aka.ms/abstracts. You can also find the paper on arXiv and on the NeurIPS conference website. I’m Amber Tingle from Microsoft Research, and we hope you’ll join us next time on Abstracts

[MUSIC FADES] 

The post Abstracts: NeurIPS 2024 with Weizhu Chen appeared first on Microsoft Research.

Read More

Abstracts: NeurIPS 2024 with Dylan Foster

Abstracts: NeurIPS 2024 with Dylan Foster

Illustrated image of Dylan Foster for the Abstracts series on the Microsoft Research Podcast.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Principal Researcher Dylan Foster joins host Amber Tingle to discuss the paper “Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity,” an oral presentation at this year’s Conference on Neural Information Processing Systems (NeurIPS). In the paper, Foster and his coauthors explore whether well-studied RL algorithms for simple problems can be leveraged to solve RL problems with high-dimensional observations and latent dynamics, part of larger efforts to identify algorithm design principles that can enable agents to learn quickly via trial and error in unfamiliar environments.

Transcript

[MUSIC]

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Our guest today is Dylan Foster. He is a principal researcher at Microsoft Research and coauthor of a paper called “Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity.” The work is among the oral presentations at this year’s Conference on Neural Information Processing Systems, or NeurIPS, in Vancouver. Dylan, welcome and thank you for joining us on the podcast!


DYLAN FOSTER: Thanks for having me.

TINGLE: Let’s start with a brief overview of this paper. Tell us about the problem this work addresses and why the research community should know about it.

FOSTER: So this is a, kind of, a theoretical work on reinforcement learning, or RL. When I say reinforcement learning, broadly speaking, this is talking about the question of how can we design AI agents that are capable of, like, interacting with unknown environments and learning how to solve problems through trial and error. So this is part of some broader agenda we’ve been doing on, kind of, theoretical foundations of RL. And the key questions we’re looking at here are what are called, like, exploration and sample efficiency. So this just means we’re trying to understand, like, what are the algorithm design principles that can allow you to explore an unknown environment and learn as quickly as possible? What we’re doing in this paper is we’re, kind of, looking at, how can you most efficiently solve reinforcement learning problems where you’re faced with very high-dimensional observations, but the underlying dynamics of the system you’re interacting with are simple? So this is a setting that occurs in a lot of natural reinforcement learning and control problems, especially in the context of, like, say, embodied decision-making. So if you think about, say, games like Pong, you know, the state of the game, like, the state of, like, Pong, is extremely simple. It’s just, you know, what is the position and velocity of the ball, and, like, where are the paddles? But what we’d like to be able to do is learn to, you know, like, control or, like, solve games like this from raw pixels or, like, images kind of in the same way that a human would, like, just solve them from vision. So if you look at these types of problems, you know, we call these, like, RL with rich observations or RL with latent dynamics. You know, these are interesting because they, kind of, require you to explore the system, but they also require, you know, representation learning. Like, you want to be able to use neural nets to learn a mapping from, say, the images you see to the latent state of the system. This is a pretty interesting and nontrivial algorithmic problem. And, kind of, what we do in this work is we take a first step towards something like a unified understanding for how to solve these sorts of, like, rich-observation, or latent dynamics, RL problems.

TINGLE: So how did you go about developing this theoretical framework?

FOSTER: Yeah, so if you look at these sort of RL problems with latent dynamics, this is something that’s actually received a lot of investigation in theory. And a lot of this goes back to, kind of, early work from our lab from, like, 2016, 2017 or so. There’s some really interesting results here, but progress was largely on a, like, case-by-case basis, meaning, you know, there are many different ways that you can try to model the latent dynamics of your problem, and, you know, each of these somehow leads to a different algorithm, right. So, like, you know, you think very hard about this modeling assumption. You think about, what would an optimal algorithm look like? And you end up, you know, writing an entire paper about it. And there’s nothing wrong with that per se, but if you want to be able to iterate quickly and, kind of, try different modeling assumptions and see what works in practice, you know, this is not really tenable. It’s just too slow. And so the starting point for this work was to, kind of, try to take a different and more modular approach. So the idea is, you know, there are many, many different types of, sort of, systems or modeling assumptions for the dynamics that have been already studied extensively and have entire papers about them for the simpler setting in which you can directly see the state of the system. And so what we wanted to ask here is, is it possible to use these existing results in more of, like, a modular fashion? Like, if someone has already written a paper on how to optimally solve a particular type of MDP, or Markov decision process, can we just take their algorithm as is and perhaps plug it into some kind of meta-algorithm that can directly, kind of, combine this with representation learning and use it to solve the corresponding rich-observation, or latent dynamics, RL problem?

TINGLE: What were your major findings? What did you learn during this process?

FOSTER: We started by asking the question sort of exactly the way that I just posed it, right. Like, can we take existing algorithms and use them to solve rich-observation RL problems in a modular fashion? And this turned out to be really tricky. Like, there’s a lot of natural algorithms you might try that seem promising at first but don’t exactly work out. And what this, kind of, led us to and, sort of, the first main result in this paper is actually a negative result. So what we actually showed is most, sort of, well-studied types of systems or, like, MDPs that have been studied in, like, the prior literature on RL, even if they’re tractable when you’re able to directly see the state of the system, they can become statistically intractable once you add, sort of, high-dimensional observations to the picture. And statistically tractable here means the amount of interaction that you need, like the amount of, sort of, attempts to explore the system that you need, in order to learn a good decision-making policy becomes, like, very, very large, like much, much larger than the corresponding, sort of, complexity if you were able to directly see the states of the system. You know, you could look at this and say, I guess we’re out of luck. You know, maybe there’s just no hope of solving these sorts of problems. But that’s perhaps a little too pessimistic. You know, really the way you should interpret this result is just that you need more assumptions. And that’s precisely what the, sort of, second result we have in this paper is. So our second result shows that you can, sort of, bypass this impossibility result and, you know, achieve truly modular algorithms under a couple different types of additional assumptions.

TINGLE: Dylan, I’d like to know—and I’m sure our audience would, too—what this work means when it comes to real-world application. What impact will this have on the research community?

FOSTER: Yeah, so maybe I’ll answer that, um, with two different points. The first one is a broader point, which is, why is it important to understand this problem of exploration and sample efficiency in reinforcement learning? If you look at the, sort of, setting we study in this paper—you know, this, like, RL or decision-making with high-dimensional observations—on the empirical side, people have made a huge amount of progress on this problem through deep reinforcement learning. This was what kind of led to these amazing breakthroughs in solving games like Atari in the last decade. But if you look at these results, the gains are somehow more coming from the, like, inductive bias or the, like, generalization abilities of deep learning and not necessarily from the specific algorithms. So, like, current algorithms do not actually explore very deliberately, and so their sample efficiency is very high. Like, it’s hard to draw a one-to-one comparison, but you can argue they need, like, far more experience than a human would to solve these sorts of problems. So it’s not clear that we’re really anywhere near the ceiling of what can be achieved in terms of, like, how efficiently can you have, you know, an agent learn to solve new problems from trial and error. And I think better algorithms here could potentially be, like, transformative in a lot of different domains. To get into this specific work, I think there’s a couple of important takeaways for researchers. One is that by giving this impossibility result that shows that RL with latent dynamics is impossible without further assumptions, we’re kind of narrowing down the search space where other researchers can look for efficient algorithms. The second takeaway is, you know, we are showing that this problem becomes tractable when you make additional assumptions. But I view these more as, like, a proof of concept. Like, we’re kind of, showing for the first time that it is possible to do something nontrivial, but I think a lot more work and research will be required in order to like, you know, build on this and take this to something that can lead to, like, practical algorithms.

TINGLE: Well, Dylan Foster, thank you for joining us today to discuss your paper on reinforcement learning under latent dynamics. We certainly appreciate it.

FOSTER: Thanks a lot. Thanks for having me.

[MUSIC]

TINGLE: And to our listeners, thank you all for tuning in. If you’d like to read Dylan’s paper, you may find a link at aka.ms/abstracts. You can also find the paper on arXiv and on the NeurIPS conference website. I’m Amber Tingle from Microsoft Research, and we hope you’ll join us next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Dylan Foster appeared first on Microsoft Research.

Read More

Abstracts: NeurIPS 2024 with Pranjal Chitale

Abstracts: NeurIPS 2024 with Pranjal Chitale

diagram

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Research Fellow Pranjal Chitale joins host Gretchen Huizinga to discuss the paper “CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark,” an oral presentation at this year’s Conference on Neural Information Processing Systems (NeurIPS). CVQA, which comprises questions and images representative of 31 languages and the cultures of 30 countries, was created in collaboration with native speakers and cultural experts to evaluate how well models perform across diverse linguistic and cultural contexts, an important step toward improving model inclusivity.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract— of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Pranjal Chitale, a research fellow at Microsoft Research India. Pranjal is coauthor of a paper called “CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark,” and this paper is an oral presentation at this week’s 38th annual Conference on Neural Information Processing Systems, or NeurIPS, in Vancouver, BC. Pranjal, thanks for joining us today on Abstracts!


PRANJAL CHITALE: Hi, Gretchen. Thanks for having me.

HUIZINGA: So, Pranjal, give us an overview of this paper. In a couple sentences, what problem are you trying to solve, and why should people care about it?

CHITALE: So we are witnessing some exciting times as LLMs are rapidly evolving as tools for countless use cases. While most of these LLMs were initially leveraged for natural language processing tasks, they are now expanded across languages and modalities. However, a major gap lies in the availability of multimodal data for non-English languages. Therefore, most multimodal models might not have coverage for non-English languages altogether or might just heavily rely on translations of the associated text in English-centric datasets so as to support multiple languages. The drawback of this approach is that it often misses the cultural nuances of local languages. And another reason why this is not optimal is the images are mostly Western-centric [and] therefore would not be well reflective of the local culture of a lot of regions. So this kind of bias can skew these models towards a Western perspective, raising concerns about inclusivity and safety of the content which they generate when serving a global population, which involves multicultural and multilingual users. Therefore, for a truly inclusive AI ecosystem, models must demonstrate cultural understanding to ensure that the generated content is safe, respectful for diverse communities. Evaluating cultural awareness, though, is extremely challenging because how to define culture itself is an unsolved problem. However, in this work, we are trying to take a step towards having a proxy which could measure cultural understanding.

HUIZINGA: Well, talk about how you did this. What methodology did you use for this paper, and what were your major findings?

CHITALE: Now that we have defined our broader problem, it is important to decide the scope of our solution because, as we discussed, culture is an umbrella term. So we need to define a smaller scope for this problem. We chose visual question answering, which is a multimodal task, and it is one of the most critical multimodal tasks for the scope of this work. So recognizing the limitations of existing VQA benchmarks, which often rely on translations and lack cultural representation, we developed CVQA, which is Culturally-diverse multilingual VQA benchmark. CVQA spans 30 countries, 31 languages, and has over 10,000 culturally nuanced questions, which were crafted by native speakers and cultural experts. So our focus was on creating questions which required what we term as cultural common sense to answer. For instance, with just the image, it is not possible to answer the question. You need some cultural awareness about the local culture to be able to answer the question. So these questions draw inspiration from knowledge of local culture. So one important aspect of this dataset is that we include both local language as well as English variants of the same question to allow robust testing of models across linguistic concepts. I would say the crux of this effort is that while most of the prior efforts may be small in terms of language—it could be language-group specific or country specific for most—but we wanted this to be a much larger global-scale collaborative effort. So this covers 31 languages across 30 countries. So to build CVQA, we worked with qualified volunteers from diverse age group and genders, ensuring that the questions authentically represented their cultures. So images which were collected, those were ensured to be copyright free, grounded in culture, and safe for work with strict guidelines to ensure that we avoid images which reflect some stereotypes or privacy violations. And we also had 10 categories, which involved topics ranging from daily life, sports, cuisine to history of the region, so a holistic view of the culture of the region. So each question was crafted as a multiple-choice task with challenging answer options which required both the image as well as cultural knowledge to solve. We also employed a maker-checker approach to ensure quality and consistency.

HUIZINGA: So you’ve created the benchmark. You’ve tested it. What were your major findings?

CHITALE: Now that we have created a benchmark, the next step is to evaluate how these multimodal models are performing on this benchmark. So we benchmark several state-of-the-art multimodal models, which include both open-source offerings like CLIP, BLIP, LLaVA-1.5, and proprietary offerings like GPT-4o or Gemini 1.5 Flash. So what we observed is there is a huge gap when it comes … in performance when we compare these proprietary offerings versus the open-source models. So GPT-4o was the highest-performing model with 75.4% accuracy on English prompts and 74.3% accuracy on local prompts. However, the story is completely different when we go to open-source models. These open-source models significantly lag behind the proprietary models. And one key finding over these open-source models is that these models perform even worse when prompted in the native language when we compare it to prompting in English. This potentially highlights that these models lack multilingual understanding capabilities, which may be because multilingual training data is pretty scarce.

HUIZINGA: Yeah.

CHITALE: So LLaVA-1.5 turned out to be the best open-source model. So one thing to notice, LLaVA-1.5 performs well across a large set of English VQA benchmarks, but when it comes to cultural understanding, it is a pretty weak model. Further, we also did some ablations to understand if adding location-specific information to the textual prompts has some impact or not, but we identified that it does not result in any significant performance improvements. Further, we also conducted a category-wise analysis. So, as we had mentioned, there are 10 categories to which these images belong. So what we observed is that certain categories, like people and everyday life, consistently saw higher accuracy across a large set of models. This may be likely due to abundance of human activity data in training datasets. However, when it comes to niche categories like cooking and food, pop culture, which are much more challenging, especially in local languages, these models struggle. Therefore, these are the kind of highly diverse cultural contexts which need improvement.

HUIZINGA: How’s this work going to make an impact outside the lab and in the real world?

CHITALE: CVQA is significant because it addresses a fundamental gap in how we evaluate vision-language and multimodal models today. While proprietary models are making impressive strides, open-source models, which are more accessible and easier to deploy, significantly lag behind in terms of cultural awareness and safety. So CVQA fills this gap and provides a much-needed benchmark to help us identify these gaps in the first place. So as to fix them, we first need to identify the gaps, and whether we are progressing or not can be captured by this benchmark. So for the real world, this benchmark does have some far-reaching implications. Models which understand culture are not just technically better, but they would create interactions which are far more engaging, natural, and safe for users from diverse backgrounds. So this benchmark offers entirely new axis for improvement, cultural awareness, and linguistic diversity. Therefore, by improving a model’s ability to handle culturally nuanced questions, CVQA ensures researchers and developers think beyond accuracy and also focus on cultural awareness and inclusivity before shipping these models into production.

HUIZINGA: Pranjal, what are the unanswered questions or unsolved problems in this field, and what do you plan to do about it?

CHITALE: So while CVQA makes some strides in addressing cultural and linguistic diversity, there is still much more to explore in this space. So this dataset only covers 31 languages and cultures, but this is just, like, a subset of the incredible diversity that exists globally. Many languages and cultures remain underrepresented, especially some of them are endangered or have limited digital resources. So expanding CVQA to include more of these languages would be a natural next step. Secondly, CVQA just focuses on single-turn question-answer pairs. But in reality, human interaction is often multi-turn and conversational in nature. So a multi-turn version of CVQA could better simulate real-world use cases and challenge models to maintain cultural and contextual awareness over extended dialogues. Another interesting area is personalization. So it would be very interesting if we could teach models to adapt to a user’s cultural background, preferences, or even regional nuances in real time. This remains a significant challenge, although this benchmark could help us move a step towards our broader goal.

[MUSIC]

HUIZINGA: Well, Pranjal Chitale, this is super important research, and thank you for joining us today. To our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find it at aka.ms/abstracts. You can also find it on arXiv and on the NeurIPS website. And if you’re at NeurIPS, you can also go hear about it. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Pranjal Chitale appeared first on Microsoft Research.

Read More

Ideas: Economics and computation with Nicole Immorlica

Ideas: Economics and computation with Nicole Immorlica

Line illustration of Nicole Immorlica

Behind every emerging technology is a great idea propelling it forward. In the Microsoft Research Podcast series Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets.

In this episode, host Gretchen Huizinga talks with Senior Principal Research Manager Nicole Immorlica. As Immorlica describes it, when she and others decided to take a computational approach to pushing the boundaries of economic theory, there weren’t many computer scientists doing research in economics. Since then, contributions such as applying approximation algorithms to the classic economic challenge of pricing and work on the stable marriage problem have earned Immorlica numerous honors, including the 2023 Test of Time Award from the ACM Special Interest Group on Economics and Computation and selection as a 2023 Association for Computing Machinery (ACM) Fellow. Immorlica traces the journey back to a graduate market design course and a realization that captivated her: she could use her love of math to help improve the world through systems that empower individuals to make the best decisions possible for themselves.

Transcript

[TEASER] 

[MUSIC PLAYS UNDER DIALOGUE]

NICOLE IMMORLICA: So honestly, when generative AI came out, I had a bit of a moment, a like crisis of confidence, so to speak, in the value of theory in my own work. And I decided to dive into a data-driven project, which was not my background at all. As a complete newbie, I was quite shocked by what I found, which is probably common knowledge among experts: data is very messy and very noisy, and it’s very hard to get any signal out of it. Theory is an essential counterpart to any data-driven research. It provides a guiding light. But even more importantly, theory allows us to illuminate things that have not even happened. So with models, we can hypothesize about possible futures and use that to shape what direction we take.

[TEASER ENDS]

GRETCHEN HUIZINGA: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. I’m Gretchen Huizinga. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.

[MUSIC FADES]

My guest on this episode is Nicole Immorlica, a senior principal research manager at Microsoft Research New England, where she leads the Economics and Computation Group. Considered by many to be an expert on social networks, matching markets, and mechanism design, Nicole has a long list of accomplishments and honors to her name and some pretty cool new research besides. Nicole Immorlica, I’m excited to get into all the things with you today. Welcome to Ideas


NICOLE IMMORLICA: Thank you. 

HUIZINGA: So before we get into specifics on the big ideas behind your work, let’s find out a little bit about how and why you started doing it. Tell us your research origin story and, if there was one, what big idea or animating “what if” inspired young Nicole and launched a career in theoretical economics and computation research? 

IMMORLICA: So I took a rather circuitous route to my current research path. In high school, I thought I actually wanted to study physics, specifically cosmology, because I was super curious about the origins and evolution of the universe. In college, I realized on a day-to-day basis, what I really enjoyed was the math underlying physics, in particular proving theorems. So I changed my major to computer science, which was the closest thing to math that seemed to have a promising career path. [LAUGHTER] But when graduation came, I just wasn’t ready to be a grownup and enter the workforce! So I defaulted to graduate school thinking I’d continue my studies in theoretical computer science. It was in graduate school where I found my passion for the intersection of CS theory and micro-economics. I was just really enthralled with this idea that I could use the math that I so love to understand society and to help shape it in ways that improve the world for everyone in it. 

HUIZINGA: I’ve yet to meet an accomplished researcher who didn’t have at least one inspirational “who” behind the “what.” So tell us about the influential people in your life. Who are your heroes, economic or otherwise, and how did their ideas inspire yours and even inform your career? 

IMMORLICA: Yeah, of course. So when I was a graduate student at MIT, you know, I was happily enjoying my math, and just on a whim, I decided to take a course, along with a bunch of my other MIT graduate students, at Harvard from Professor Al Roth. And this was a market design course. We didn’t even really know what market design was, but in the context of that course, Al himself and the course content just demonstrated to me the transformative power of algorithms and economics. So, I mean, you might have heard of Al. He eventually won a Nobel Prize in economics for his work using a famous matching algorithm to optimize markets for doctors and separately for kidney exchange programs. And I thought to myself, wow, this is such meaningful work. This is something that I want to do, something I can contribute to the world, you know, something that my skill set is well adapted to. And so I just decided to move on with that, and I’ve never really looked back. It’s so satisfying to do something that’s both … I like both the means and I care very deeply about the ends. 

HUIZINGA: So, Nicole, you mentioned you took a course from Al Roth. Did he become anything more to you than that one sort of inspirational teacher? Did you have any interaction with him? And were there any other professors, authors, or people that inspired you in the coursework and graduate studies side of things? 

IMMORLICA: Yeah, I mean, Al has been transformative for my whole career. Like, I first met him in the context of that course, but I, and many of the graduate students in my area, have continued to work with him, speak to him at conferences, be influenced by him, so he’s been there throughout my career for me. 

HUIZINGA: Right, right, right … 

IMMORLICA: In terms of other inspirations, I’ve really admired throughout my career … this is maybe more structurally how different individuals operate their careers. So, for example, Jennifer Chayes, who was the leader of the Microsoft Research lab that I joined … 

HUIZINGA: Yeah! 

IMMORLICA: … and nowadays Sue Dumais. Various other classic figures like Éva Tardos. Like, all of these are incredibly strong, driven women that have a vision of research, which has been transformative in their individual fields but also care very deeply about the community and the larger context than just themselves and creating spaces for people to really flourish. And I really admire that, as well. 

HUIZINGA: Yeah, I’ve had both Sue and Jennifer on the show before, and they are amazing. Absolutely. Well, listen, Nicole, as an English major, I was thrilled—and a little surprised—to hear that literature has influenced your work in economics. I did not have that on my bingo card. Tell us about your interactions with literature and how they broadened your vision of optimization and economic models.

IMMORLICA: Oh, I read a lot, especially fiction. And I care very deeply about being a broad human being, like, with a lot of different facets. And so I seek inspiration not just from my fellow economists and computer scientists but also from artists and writers. One specific example would be Walt Whitman. So I took up this poetry class as an MIT alumni, Walt Whitman, and we, in the context of that course, of course, read his famous poem “Song of Myself.” And I remember one specific verse just really struck me, where he writes, “Do I contradict myself? Very well then I contradict myself, (I am large, I contain multitudes.)” And this just was so powerful because, you know, in traditional economic models, we assume that individuals seek to optimize a single objective function, which we call their utility, but what Whitman is pointing out is that we actually have many different objective functions, which can even conflict with one another, and some at times are more salient than others, and they arise from my many identities as a member of my family, as an American, as you know, a computer scientist, as an economist, and maybe we should actually try to think a little bit more seriously about these multiple identities in the context of our modeling. 

HUIZINGA: That just warms my English major heart … [LAUGHS] 

IMMORLICA: I’m glad! [LAUGHS] 

HUIZINGA: Oh my gosh. And it’s so interesting because, yeah, we always think of, sort of, singular optimization. And so it’s like, how do we expand our horizon on that sort of optimization vision? So I love that. Well, you’ve received what I can only call a flurry of honors and awards last year. Most recently, you were named an ACM Fellow—ACM being Association for Computing Machinery, for those who don’t know—which acknowledges people who bring, and I quote, “transformative contributions to computing science and technology.” Now your citation is for, and I quote again, “contributions to economics and computation, including market design, auctions, and social networks.” That’s a mouthful, but if we’re talking about transformative contributions, how were things different before you brought your ideas to this field, and how were your contributions transformative or groundbreaking? 

IMMORLICA: Yeah, so it’s actually a relatively new thing for computer scientists to study economics, and I was among the first cohort to do so seriously. So before our time, economists mostly focused on finding optimal solutions to the problems they posed without regard for the computational or informational requirements therein. But computer scientists have an extensive toolkit to manage such complexities. So, for example, in a paper on pricing, which is a classic economic problem—how do we set up prices for goods in a store?—my coauthors and I used the computer science notion of approximation to show that a very simple menu of prices generates almost optimal revenue for the seller. And prior to this work, economists only knew how to characterize optimal but infinitely large and thereby impractical menus of prices. So this is an example of the kind of work that I and other computer scientists do that can really transform economics. 

HUIZINGA: Right. Well, in addition to the ACM fellowship, another honor you received from ACM in 2023 was the Test of Time Award, where the Special Interest Group on Economics and Computation, or SIGecom, recognizes influential papers published between 10 and 25 years ago that significantly impacted research or applications in economics and computation. Now you got this award for a paper you cowrote in 2005 called “Marriage, Honesty, and Stability.” Clearly, I’m not an economist because I thought this was about how to avoid getting a divorce, but actually, it’s about a well-known and very difficult problem called the stable marriage problem. Tell us about this problem and the paper and why, as the award states, it’s stood the test of time. 

IMMORLICA: Sure. You’re not the only one to have misinterpreted the title. [LAUGHTER] I remember I gave a talk once and someone came and when they left the talk, they said, I did not think that this was about math! But, you know, math, as I learned, is about life, and the stable marriage problem has, you know, interpretation about marriage and divorce. In particular, the problem asks, how can we match market participants to one another such that no pair prefer each other to their assigned match? So to relate this to the somewhat outdated application of marriage markets, the market participants could be men and women, and the stable marriage problem asks if there is a set of marriages such that no pair of couples seeks a divorce in order to marry each other. And so, you know, that’s not really a problem we solve in real life, but there’s a lot of modern applications of this problem. For example, assigning medical students to hospitals for their residencies, or if you have children, many cities in the United States and around the world use this stable marriage problem to think about the assignment of K-to-12 students to public schools. And so in these applications, the stability property has been shown to contribute to the longevity of the market. And in the 1960s, David Gale and Lloyd Shapley proved, via an algorithm, interestingly, that stable matches exist! Well, in fact, there can be exponentially many stable matches. And so this leads to a very important question for people that want to apply this theory to practice, which is, which stable match should they select among the many ones that exist, and what algorithm should they use to select it? So our work shows that under very natural conditions, namely that preference lists are short and sufficiently random, it doesn’t matter. Most participants have a unique stable match. And so, you know, you can just design your market without worrying too much about what algorithm you use or which match you select because for most people it doesn’t matter. And since our paper, many researchers have followed up on our work studying conditions under which matchings are essentially unique and thereby influencing policy recommendations. 

HUIZINGA: Hmm. So this work was clearly focused on the economics side of things like markets. So this seems to have wide application outside of economics. Is that accurate? 

IMMORLICA: Well, it depends how you define economics, so I would … 

HUIZINGA: I suppose! [LAUGHTER] 

IMMORLICA: I define economics as the problem … I mean, Al Roth, for example, wrote a book whose title was Who Gets What—and Why. 

HUIZINGA: Ooh.

IMMORLICA: So economics is all about, how do we allocate stuff? How do we allocate scarce resources? And many economic problems are not about spending money. It’s about how do we create outcomes in the world. 

HUIZINGA: Yeah. 

IMMORLICA: And so I would say all of these problem domains are economics. 

HUIZINGA: Well, finally, as regards the “flurry” of honors, besides being named an ACM Fellow and also this Test of Time Award, you were also named an Economic Theory Fellow by the Society for [the] Advancement of Economic Theory, or SAET. And the primary qualification here was to have “substantially or creatively advanced theoretical economics.” So what were the big challenges you tackled, and what big ideas did you contribute to advance economic theory? 

IMMORLICA: So as we’ve discussed, I and others with my background have done a lot to advance economic theory through the lens of computational thinking. 

HUIZINGA: Mmm … 

IMMORLICA: We’ve introduced ideas such as approximation, which we discussed earlier, or machine learning to economic models and proposing them as solution concepts. We’ve also used computer science tools to solve problems within these models. So two examples from my own work include randomized algorithm analysis and stochastic gradient descent. And importantly, we’ve introduced very relevant new settings to the field of economics. So, you know, I’ve worked hard on large-scale auction design and associated auto-bidding algorithms, for instance, which are a primary source of revenue for tech companies these days. I’ve thought a lot about how data enters into markets and how we should think about data in the context of market design. And lately, I’ve spent a lot of time thinking about generative AI and its impact in the economy at both the micro and macro levels. 

HUIZINGA: Yeah. Let’s take a detour for a minute and get into the philosophical weeds on this idea of theory. And I want to cite an article that was written way back in 2008 by the editor of Wired magazine at the time, Chris Anderson. He wrote an article titled “The End of Theory,” which was provocative in itself. And he began by quoting the British statistician George Box, who famously said, “All models are wrong, but some are useful.” And then he argued that in an era of massively abundant data, companies didn’t have to settle for wrong models. And then he went even further and attacked the very idea of theory and, citing Google, he said, “Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity.” So, Nicole, from your perch, 15 years later, in the age of generative AI, what did Chris Anderson get right, and what did he get wrong? 

IMMORLICA: So, honestly, when generative AI came out, I had a bit of a moment, a like crisis of confidence, so to speak, in the value of theory in my own work. 

HUIZINGA: Really! 

IMMORLICA: And I decided to dive into a data-driven project, which was not my background at all. As a complete newbie, I was quite shocked by what I found, which is probably common knowledge among experts: data is very messy and very noisy, and it’s very hard to get any signal out of it. Theory is an essential counterpart to any data-driven research. It provides a guiding light. But even more importantly, theory allows us to illuminate things that have not even happened. So with models, we can hypothesize about possible futures and use that to shape what direction we take. Relatedly, what I think that article got most wrong was the statement that correlation supersedes causation, which is actually how the article closes, this idea that causation is dead or dying. I think causation will never become irrelevant. Causation is what allows us to reason about counterfactuals. It’s fundamentally irreplaceable. It’s like, you know, data, you can only see data about things that happened. You can’t see data about things that could happen but haven’t or, you know, about alternative futures. 

HUIZINGA: Interesting. 

IMMORLICA: And that’s what theory gives you. 

HUIZINGA: Yeah. Well, let’s continue on that a little bit because this show is yet another part of our short “series within a series” featuring some of the work going on in the AI, Cognition, and the Economy initiative at Microsoft Research. And I just did an episode with Brendan Lucier and Mert Demirer on the micro- and macro-economic impact of generative AI. And you were part of that project, but another fascinating project you’re involved in right now looks at the impact of generative AI on what you call the “content ecosystem.” So what’s the problem behind this research, and what unique incentive challenges are content creators facing in light of large language and multimodal AI models? 

IMMORLICA: Yeah, so this is a project with Brendan, as well, whom you interviewed previously, and also Nageeb Ali, an economist and AICE Fellow at Penn State, and Meena Jagadeesan, who was my intern from Microsoft Research from UC Berkeley. So when you think about content or really any consumption good, there’s often a whole supply chain that produces it. For music, for example, there’s the composition of the song, the recording, the mixing, and finally the delivery to the consumer. And all of these steps involve multiple humans producing things, generating things, getting paid along the way. One way to think about generative AI is that it allows the consumer to bypass this supply chain and just generate the content directly. 

HUIZINGA: Right … 

IMMORLICA: So, for example, like, I could ask a model, an AI model, to compose and play a song about my cat named Whiskey. [LAUGHTER] And it would do a decent job of it, and it would tailor the song to my specific situation. But there are drawbacks, as well. One thing many researchers fear is that AI needs human-generated content to train. And so if people start bypassing the supply chain and just using AI-generated content, there won’t be any content for AI to train on and AI will cease to improve.

HUIZINGA: Right. 

IMMORLICA: Another thing that could be troubling is that there are economies of scale. So there is a nontrivial cost to producing music, even for AI, and if we share that cost among many listeners, it becomes more affordable. But if we each access the content ourselves, it’s going to impose a large per-song cost. And then finally, and this is, I think, most salient to most people, there’s some kind of social benefit to having songs that everyone listens to. It provides a common ground for understanding. It’s a pillar of our culture, right. And so if we bypass that, aren’t we losing something? So for all of these reasons, it becomes very important to understand the market conditions under which people will choose to bypass supply chains and the associated costs and benefits of this. What we show in this work, which is very much work in progress, is that when AI is very costly, neither producers nor consumers will use it, but as it gets cheaper, at first, it actually helps content producers that can leverage it to augment their own ability, creating higher-quality content, more personalized content more cheaply. But then, as the AI gets super cheap, this bypassing behavior starts to emerge, and the content creators are driven out of the market. 

HUIZINGA: Right. So what do we do about that? 

IMMORLICA: Well, you know, you have to take a stance on whether that’s even a good thing or a bad thing, … 

HUIZINGA: Right! 

IMMORLICA: … so it could be that we do nothing about it. We could also impose a sort of minimum wage on AI, if you like, to artificially inflate its costs. We could try to amplify the parts of the system that lead towards more human-generated content, like this sociability, the fact that we all are listening to the same stuff. We could try to make that more salient for people. But, you know, generally speaking, I’m not really in a place to take a stance on whether this is a good thing or a bad thing. I think this is for policymakers. 

HUIZINGA: It feels like we’re at an inflection point. I’m really interested to see what your research in this arena, the content ecosystem, brings. You know, I’ll mention, too, recently I read a blog written by Yoshua Bengio and Vincent Conitzer, and they acknowledged that the image that they used at the top had been created by an AI bot. And then they said they made a donation to an art museum to say, we’re giving something back to the artistic community that we may have used. Where do you see this, you know, #NoLLM situation coming in this content ecosystem market? 

IMMORLICA: Yeah, that’s a very interesting move on their part. I know Vince quite well, actually. I’m not sure that artists of the sort of “art museum nature” suffer, so … 

HUIZINGA: Right? [LAUGHS] 

IMMORLICA: One of my favorite artists is Laurie Anderson. I don’t know if you’ve seen her work at all … 

HUIZINGA: Yeah, I have, yeah. 

IMMORLICA: … but she has a piece in the MASS MoCA right now, which is just brilliant, where she actually uses generative AI to create a sequence of images that creates an alternate story about her family history. And it’s just really, really cool. I’m more worried about people who are doing art vocationally, and I think, and maybe you heard some of this from Mert and Brendan, like what’s going to happen is that careers are going to shift and different vocations will become more salient, and we’ve seen this through every technological revolution. People shift their work towards the things that are uniquely human that we can provide and if generating an image at the top of a blog is not one of them, you know, so be it. People will do something else. 

HUIZINGA: Right, right, right. Yeah, I just … we’re on the cusp, and there’s a lot of things that are going to happen in the next couple of years, maybe a couple of months, who knows? [LAUGHTER] Well, we hear a lot of dystopian fears—some of them we’ve just referred to—around AI and its impact on humanity, but those fears are often dismissed by tech optimists as what I might call “unwishful thinking.” So your research interests involve the design and use of sociotechnical systems to quote, “explain, predict, and shape behavioral patterns in various online and offline systems, markets, and games.” Now I’m with you on the “explain and predict” but when we get to shaping behavioral patterns, I wonder how we tease out the bad from the good. So, in light of the power of these sociotechnical systems, what could possibly go wrong, Nicole, if in fact you got everything right? 

IMMORLICA: Yeah, first I should clarify something. When I say I’m interested in shaping behavioral patterns, I don’t mean that I want to impose particular behaviors on people but rather that I want to design systems that expose to people relevant information and possible actions so that they have the power to shape their own behavior to achieve their own goals. And if we’re able to do that, and do it really well, then things can only really go wrong if you believe people aren’t good at making themselves happy. I mean, there’s certainly evidence of this, like the field of behavioral economics, to which I have contributed some, tries to understand how and when people make mistakes in their behavioral choices. And it proposes ways to help people mitigate these mistakes. But I caution us from going too far in this direction because at the end of the day, I believe people know things about themselves that no external authority can know. And you don’t want to impose constraints that prevent people from acting on that information. 

HUIZINGA: Yeah. 

IMMORLICA: Another issue here is, of course, externalities. It could be that my behavior makes me happy but makes you unhappy. [LAUGHTER] So another thing that can go wrong is that we, as designers of technology, fail to capture these underlying externalities. I mean, ideally, like an economist would say, well, you should pay with your own happiness for any negative externality you impose on others. And the fields of market and mechanism design have identified very beautiful ways of making this happen automatically in simple settings, such as the famous Vickrey auction. But getting this right in the complex sociotechnical systems of our day is quite a challenge. 

HUIZINGA: OK, go back to that auction. What did you call it? The Vickrey auction? 

IMMORLICA: Yeah, so Vickrey was an economist, and he proposed an auction format that … so an auction is trying to find a way to allocate goods, let’s say, to bidders such that the bidders that value the goods the most are the ones that win them. 

HUIZINGA: Hm. 

IMMORLICA: But of course, these bidders are imposing a negative externality on the people who lose, right? [LAUGHTER] And so what Vickrey showed is that a well-designed system of prices can compensate the losers exactly for the externality that is imposed on them. A very simple example of a Vickrey auction is if you’re selling just one good, like a painting, then what you should do, according to Vickrey, is solicit bids, give it to the highest bidder, and charge them the second-highest price. 

HUIZINGA: Interesting … 

IMMORLICA: And so … that’s going to have good outcomes for society. 

HUIZINGA: Yeah, yeah. I want to expand on a couple of thoughts here. One is as you started out to answer this question, you said, well, I’m not interested in shaping behaviors in terms of making you do what I want you to do. But maybe someone else is. What happens if it falls into the wrong hands? 

IMMORLICA: Yeah, I mean, there’s definitely competing interests. Everybody has their own objectives, and … 

HUIZINGA: Sure, sure. 

IMMORLICA: … I might be very fundamentally opposed to some of them, but everybody’s trying to optimize something, and there are competing optimization objectives. And so what’s going to happen if people are leveraging this technology to optimize for themselves and thereby harming me a lot? 

HUIZINGA: Right? 

IMMORLICA: Ideally, we’ll have regulation to kind of cover that. I think what I’m more worried about is the idea that the technology itself might not be aligned with me, right. Like at the end of the day, there are companies that are producing this technology that I’m then using to achieve my objectives, but the company’s objectives, the creators of the technology, might not be completely aligned with the person’s objectives. And so I’ve looked a little bit in my research about how this potential misalignment might result in outcomes that are not all that great for either party. 

HUIZINGA: Wow. Is that stuff that’s in the works? 

IMMORLICA: We have a few published papers on the area. I don’t know if you want me to get into them. 

HUIZINGA: No, actually, what we’ll probably do is put some in the show notes. We’ll link people to those papers because I think that’s an interesting topic. Listen, most research is incremental in nature, where the ideas are basically iterative steps on existing work. But sometimes there are out-of-the-box ideas that feel like bigger swings or even outrageous, and Microsoft is well known for making room for these. Have you had an idea that felt outrageous, any idea that felt outrageous, or is there anything that you might even consider outrageous now that you’re currently working on or even thinking about? 

IMMORLICA: Yeah, well, I mean, this whole moment in history feels outrageous, honestly! [LAUGHTER] It’s like I’m kind of living in the sci-fi novels of my youth. 

HUIZINGA: Right? 

IMMORLICA: So together with my economics and social science colleagues at Microsoft Research, one thing that we’re really trying to think through is this outrageous idea of agentic AI

HUIZINGA: Mmm … 

IMMORLICA: That is, every single individual and business can have their own AI that acts like their own personal butler that knows them intimately and can take actions on their behalf. In such a world, what will become of the internet, social media, platforms like Amazon, Spotify, Uber? On the one hand, you know, maybe this is good because these individual agentic AIs can just bypass all of these kinds of intermediaries. For example, if I have a busy day of back-to-back meetings at work, my personal AI can notice that I have no time for lunch, contact the AI of some restaurant to order a sandwich for me, make sure that sandwich is tailored to my dietary needs and preferences, and then contact the AI of a delivery service to make sure that sandwich is sitting on my desk when I walk into my noon meeting, right. 

HUIZINGA: Right … 

IMMORLICA: And this is a huge disruption to how things currently work. It’s shifting the power away from centralized platforms, back to individuals and giving them the agency over their data and the power to leverage it to fulfill their needs. So the, sort of, big questions that we’re thinking about right now is, how will such decentralized markets work? How will they be monetized? Will it be a better world than the one we live in now, or are we losing something? And if it is a better world, how can we get from here to there? And if it’s a worse world, how can we steer the ship in the other direction, you know? 

HUIZINGA: Right. 

IMMORLICA: These are all very important questions in this time. 

HUIZINGA: Does this feel like it’s imminent? 

IMMORLICA: I do think it’s imminent. And I think, you know, in life, you can, kind of, decide whether to embrace the good or embrace the bad, see the glass as half-full or half-empty, and … 

HUIZINGA: Yeah. 

IMMORLICA: … I am hoping that society will see the half-full side of these amazing technologies and leverage them to do really great things in the world. 

HUIZINGA: Man, I would love to talk to you for another hour, but we have to close things up. To close this show, I want to do something new with you, a sort of lightning round of short questions with short answers that give us a little window into your life. So are you ready? 

IMMORLICA: Yup! 

HUIZINGA: OK. First one, what are you reading right now for work? 

IMMORLICA: Lots of papers of my students that are on the job market to help prepare recommendation letters. It’s actually very inspiring to see the creativity of the younger generation. In terms of books, I’m reading the Idea Factory, which is about the creation of Bell Labs. 

HUIZINGA: Ooh! Interesting! 

IMMORLICA: You might be interested in it actually. It actually talks about the value of theory and understanding the fundamentals of a problem space and the sort of business value of that, so it’s very intriguing. 

HUIZINGA: OK, second question. What are you reading for pleasure? 

IMMORLICA: The book on my nightstand right now is the Epic of Gilgamesh, the graphic novel version. I’m actually quite enthralled by graphic novels ever since I first encountered Maus by Art Spiegelman in the ’90s. But my favorite reading leans towards magic realism, so like Gabriel García Márquez, Italo Calvino, Isabel Allende, and the like. I try to read nonfiction for pleasure, too, but I generally find life is a bit too short for that genre! [LAUGHTER] 

HUIZINGA: Well, and I made an assumption that what you were reading for work wasn’t pleasurable, but um, moving on, question number three, what app doesn’t exist but should? 

IMMORLICA: Teleportation. 

HUIZINGA: Ooh, fascinating. What app exists but shouldn’t? 

IMMORLICA: That’s much harder for me. I think all apps within legal bounds should be allowed to exist and the free market should decide which ones survive. Should there be more regulation of apps? Perhaps. But more at the level of giving people tools to manage their consumption at their own discretion and not outlawing specific apps; that just feels too paternalistic to me. 

HUIZINGA: Interesting. OK, next question. What’s one thing that used to be very important to you but isn’t so much anymore? 

IMMORLICA: Freedom. So by that I mean the freedom to do whatever I want, whenever I want, with whomever I want. This feeling that I could go anywhere at any time without any preparation, that I could be the Paul Erdős of the 21st century, traveling from city to city, living out of a suitcase, doing beautiful math just for the art of it. This feeling that I have no responsibilities. Like, I really bought into that in my 20s. 

HUIZINGA: And not so much now? 

IMMORLICA: No. 

HUIZINGA: OK, so what’s one thing that wasn’t very important to you but is now? 

IMMORLICA: Now, as Janis Joplin sang, “Freedom is just another word for nothing left to lose.” [LAUGHTER] And so now it’s important to me to have things to lose—roots, family, friends, pets. I think this is really what gives my life meaning. 

HUIZINGA: Yeah, having Janis Joplin cited in this podcast wasn’t on my bingo card either, but that’s great. Well, finally, Nicole, I want to ask you this question based on something we talked about before. Our audience doesn’t know it, but I think it’s funny. What do Norah Jones and oatmeal have in common for you? 

IMMORLICA: Yeah, so I use these in conversation as examples of comfort and nostalgia in the categories of music and food because I think they’re well-known examples. But for me personally, comfort is the Brahms Cello Sonata in E Minor, which was in fact my high school cello performance piece, and nostalgia is spaghetti with homemade marinara sauce, either my boyfriend’s version or, in my childhood, my Italian grandma’s version. 

HUIZINGA: Man! Poetry, art, cooking, music … who would have expected all of these to come into an economist/computer scientist podcast on the Microsoft Research Podcast. Nicole Immorlica, how fun to have you on the show! Thanks for joining us today on Ideas

IMMORLICA: Thank you for having me. 

[MUSIC] 

The post Ideas: Economics and computation with Nicole Immorlica appeared first on Microsoft Research.

Read More

Research Focus: Week of December 2, 2024

Research Focus: Week of December 2, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: Week of December 2, 2024

Adaptive Security, Erasures, and Network Assumptions in Communication-Local MPC

n-party Multi-Party Computation (MPC) is a cryptographic protocol technique that allows separate parties to securely compute a function on their joint data while keeping their inputs private. To build such a protocol, most works require all pairs of participating parties to be able to securely and reliably communicate with each other. Recently, the problem of Communication-Local (CL) MPC has been explored where this assumption is modelled more realistically – e.g. by only requiring that participating parties can securely and reliably communicate with a few other participating parties (as for example in networks like blockchains). However, few solutions exist that guarantee adaptive security—resilience to dynamic corruption of parties—and most rely on strong assumptions about party actions.

In a recent paper: Adaptive Security, Erasures, and Network Assumptions in Communication-Local MPC, researchers from Microsoft and external collaborators revisit assumptions made in earlier work. The authors conclude that for secure, adaptive CL-MPC, some previously assumed capabilities (like secure erasure and multisend) can be bypassed under certain conditions; however, fully reducing all-to-all to all-to-one communication remains unachievable in CL settings without some minimal assumptions. They propose a new SOS-RMT protocol, enabling more efficient CL-MPC under specific feasibility bounds and additional cryptographic assumptions.


Cuttlefish: A Fair, Predictable Execution Environment for Cloud-hosted Financial Exchanges

Low-latency algorithmic trading is driving efficiency in modern financial markets by promoting accurate/timely pricing of securities, higher liquidity, and lower trade costs for investors. The goal is to process incoming market data and issue trades as quickly as possible to take advantage of ephemeral market-making and arbitrage opportunities. Interest in cloud-hosted financial exchanges is growing, as they promise a cost-effective platform more accessible to market participants, among other benefits.

Unfortunately, one of the major roadblocks in cloud environments is to ensure equal network and compute despite the unpredictable network latencies as well as non-deterministic computation times.

In a recent preprint: Cuttlefish: A Fair, Predictable Execution Environment for Cloud-hosted Financial Exchanges, researchers from Microsoft and external collaborators present a fair-by-design algorithmic trading platform that can run in cloud environments. Cuttlefish aims to apply efficient and robust mapping of real operations to a novel formulation of ‘virtual time’. This allows Cuttlefish to push fairness to the extreme, regardless of the underlying network communication and computation hardware. The researchers’ implementation and evaluation validate the practicality of Cuttlefish and shows its operational efficiency on public cloud platforms. This paper builds on previous work: Rethinking Cloud-hosted Financial Exchanges for Response Time Fairness and DBO: Fairness for Cloud-Hosted Financial Exchanges. 


Spotlight: Blog post

Research Focus: Week of September 9, 2024

Investigating vulnerabilities in LLMs; A novel total-duration-aware (TDA) duration model for text-to-speech (TTS); Generative expert metric system through iterative prompt priming; Integrity protection in 5G fronthaul networks.


LLM2CLIP: Powerful language model unlocks richer visual representation

CLIP is a prominent multimodal foundational model, aligning visual and textual signals into a shared feature space. It supports various tasks, including zero-shot classification, detection, segmentation, and cross-modal retrieval, significantly influencing the entire multimodal domain. As a feature extractor, it has become dominant in cross-modal representation tasks such as image understanding, video understanding, and text-to-image/video generation. However, rapid advancements in large language models (LLMs) are continually pushing the boundaries of language comprehension and generation. Can the capabilities of LLMs be harnessed to further improve multimodal representation learning?

In a recent article: LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation, researchers from Microsoft and external collaborators propose LLM2CLIP, a novel approach to unlock CLIP’s potential, focusing on fundamental optimizations of promising foundation models. By fine-tuning the LLM in the caption space with contrastive learning, they extract its textual capabilities into the output embeddings, significantly improving the output layer’s textual discriminability. The researchers then design a training process where the fine-tuned LLM acts as a powerful teacher for CLIP’s visual encoder. The LLM’s presence allows them to incorporate longer and more complex captions without being restricted by CLIP’s text encoder’s context window and ability limitations. Their experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.


LORASC: Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning

Foundation models, which are large-scale models pre-trained on extensive datasets and subsequently adapted for specific downstream tasks, have become integral to contemporary machine learning frameworks. Fine-tuning these models is essential, yet full parameter fine-tuning often encounters significant memory and computational bottlenecks. Parameter-efficient finetuning (PEFT) techniques aim to minimize the number of trainable parameters to reduce training costs and improve training stability. Among these techniques, Low-Rank Adaptation (LoRA) is highly efficient, although it has limitations in terms of expressiveness and generalization have been noted.

In a recent paper: LORASC: Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning, researchers from Microsoft and external collaborators present an innovative technique designed to enhance LoRA’s expressiveness and generalization capabilities while preserving its training efficiency. Their cascaded learning strategy enables a mixture-of-low-rank adaptation, thereby increasing the model’s ability to capture complex patterns. They also introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. Their extensive experiments on various language and vision datasets, as well as robustness benchmarks, show that the proposed method significantly outperforms existing baselines, while also mitigating overfitting, enhancing model stability, and improving out-of-distribution (OOD) robustness.

Microsoft Research in the news


Can AI spot the next bomb cyclone far in advance? Microsoft hopes so 

Seattle Times | November 23, 2024

Microsoft claims that Aurora, a deep-learning model that’s constantly being trained, can produce weather forecasts much faster than — and with accuracy that meets or exceeds — traditional forecasting models.


How Microsoft’s next-gen BitNet architecture is turbocharging LLM efficiency  

VentureBeat | November 13, 2024

One-bit large language models (LLMs) have emerged as a promising approach to making generative AI more accessible and affordable. In a new paper, Microsoft researchers introduce Binet a4.8, a new technique that further improves the efficiency of one-bit LLMs without sacrificing their performance.


2024 Ellison Cliffe Lecture: AI in science and medicine with Christopher Bishop 

Royal Society of Medicine | November 13, 2024

Christopher Bishop, Technical Fellow and Director of Microsoft Research AI for Science, discusses the extraordinary advances in the deep learning technology that underpins the AI revolution, including crucial progress in the fields of scientific discovery and medicine. This recent speech at the Royal Society of Medicine includes current examples of AI’s impact in materials design, drug discovery, and healthcare.

The post Research Focus: Week of December 2, 2024 appeared first on Microsoft Research.

Read More

MarS: A unified financial market simulation engine in the era of generative foundation models

MarS: A unified financial market simulation engine in the era of generative foundation models

MarS illustration with document workflow and chatbot icons on a purple gradient background

Introduction

Generative foundation models have transformed various domains, creating new paradigms for content generation. Integrating these models with domain-specific data enables industry-specific applications. Microsoft Research has used this approach to develop the large market model (LMM) and the Financial Market Simulation Engine (MarS) for the financial domain. These innovations have the potential to empower financial researchers to customize generative models for diverse scenarios, establishing a new paradigm for applying generative models to downstream tasks in financial markets. This integration may provide enhanced efficiency, more accurate insights, and significant advancements in the financial domain. 

Applying generative models to financial markets

In recent years, generative foundation models have achieved notable success in fields like natural language processing and media generation. Their rise has sparked a new wave of research and industrial adoption, reshaping production processes across industries. These models excel due to three essential elements: a large volume of high-quality training data; effective tokenization and serialization of core information (such as semantic information in text); and an auto-regressive training approach that models data comprehensively, enabling implicit reasoning. 

Building on years of AI applications across industries, Microsoft researchers recognized that combining generative models with domain-specific data could lead to impactful solutions, particularly in finance. The financial market is a prime example, notably for its vast amount of order data, which are characterized by three key features: 

  • Fine granularity: Orders, as the atomic data in the financial market, provide a comprehensive and detailed representation of the real market. Combined with matching rules, one can reproduce the entire market operation process. 
  • Large scale: Electronic trading has resulted in the accumulation of massive trade-order data across global exchanges
  • Well-structured: The structured nature of order data makes it ideal for tokenization and sequential modeling

These characteristics position order flow data as a critical foundation for generative modeling in financial markets. To this end, Microsoft Research developed the LMM and the MarS, which financial researchers can use to customize generative models for various applications, thus fostering a new paradigm of generative solutions for all downstream tasks in finance. This has the potential to advance efficiency and insight generation in the financial industry. 

Figure 1: Illustration of Stock Market and Orders. On the left, a document icon shows order details. An arrow points to the right where multiple icons (robots and human figures) interact with charts and graphs representing market data.
Figure 1: Illustration of stock market and orders

Tokenization of order flow information

Order flow data is vital for generative models in finance, reflecting real-time interactions among market participants. It offers two types of value: 

  • Fine-grained market feedback: Each order, especially large ones, may influence others’ decisions, providing a micro-level view of pricing behavior. 
  • Macroscopic market dynamics: Collective interactions shape trading dynamics over time, capturing the evolution and resolution of conflicts between market forces. 

Researchers at Microsoft developed LMM by modeling both individual orders and entire order sets over time. This two-tiered approach captures both fine-grained feedback and macro-level dynamics of competition. Figure 2 shows the tokenization techniques for these models, enabling high-fidelity simulations of complex market dynamics. 

Figure 2: Illustration of Tokenization for Individual Orders (Top) and Batch Orders (Bottom) . At the top left, a green document labeled 'Type Price Volume Interval' is connected by dotted lines to another document icon. To the right, a bar chart with red and green bars shows volume on the y-axis and numbers on the x-axis. Below, an arrow points from an 'Order Batch' section with three documents to three grids.
Figure 2: Tokenization for individual orders (top) and batch orders (bottom) 

Expansion law of large market model: Unlocking the potential of financial data 

The effectiveness of generative models improves significantly with larger training datasets and model parameters. Researchers at Microsoft used two tokenization strategies to design models based on the Transformer architecture, testing them across varying data scales. Figure 3 illustrates the scaling behavior of both the order and order batch models, highlighting insights from historical trading data. This integration enhances the model’s ability to generate order flows with a deep understanding of market intricacies, enabling more accurate time-series modeling. 

Figure 3: Two line graphs comparing validation loss against the number of training tokens for different model sizes. The left graph, titled 'Order Model,' shows curves for model sizes ranging from 2M to 1.02B, with validation loss decreasing as the number of training tokens increases. The right graph, titled 'Order-Batch Model,' displays curves for model sizes ranging from 150M to 3B, also showing a decrease in validation loss with an increase in training tokens.
Figure 3: Scaling curves of order and batch order models under different parameter sizes 

Microsoft research podcast

Collaborators: Silica in space with Richard Black and Dexter Greene

College freshman Dexter Greene and Microsoft research manager Richard Black discuss how technology that stores data in glass is supporting students as they expand earlier efforts to communicate what it means to be human to extraterrestrials.


MarS based on LMM

A customizable generative model for financial scenarios

Generative models, once trained, can be easily adapted for a range of downstream tasks, often outperforming traditional models tailored for specific scenarios. Building on the development of LMM, researchers further analyzed the needs of various financial scenarios and designed MarS as a versatile financial market simulation engine. MarS not only serves as a general-purpose simulation tool but also introduces a novel framework for applying generative models across diverse financial tasks, from market prediction and risk assessment to trading strategy optimization. 

Figure 4: Diagram of the MarS framework showing data flow and interactions between components like the current market & environment data, order-level historical market data, large marke model, generated order sequences, simulated market trajectories, and applications.
Figure 4: Framework of MarS

Constructing a unified paradigm for prediction and detection tasks 

Traditional financial prediction solutions often require the development of specialized algorithms, which must be frequently adjusted, consuming time and resources. LMM’s capacity to model financial markets in depth allows for periodic updates based on the latest data. MarS creates a virtual exchange to match order flows generated by LMM, simulating trades and deriving simulated market trajectories (see the top right of Figure 4). This approach can effectively address common prediction and detection tasks in financial scenarios, introducing innovative solutions within the generative model framework. 

Applications in prediction tasks

Prediction tasks, vital in finance, involve estimating future market metrics. Traditional models require modifications with any change in prediction targets. MarS addresses this by continuously generating future order flows from recent data, which are matched in a virtual exchange, allowing for the simulation of potential future market trajectories. This provides a significant advancement in forecasting capabilities.

Figure 5 demonstrates the use of MarS in forecasting stock-price movements, where its performance significantly outperforms traditional benchmark algorithms. Taking the Order Model (1.02B) for instance, its performance exceeds that of DeepLOB by approximately (0.662/0.583−1=13.5%) at a 1-minute horizon and increases to (0.579/0.473−1=22.4%) at a 5-minute horizon This widening performance gap suggests that the Order Model maintains its predictive accuracy more effectively over longer horizons, highlighting its superior generalization capability compared to baseline, especially as the prediction task becomes more challenging over extended timeframes. This provides an attractive solution for prediction tasks in financial markets, while also highlighting LMM’s capability in modeling stock market dynamics. 

Figure 5: Line graph comparing prediction accuracy over time for three models: DeepLOB, Order Model (0.22B), and Order Model (1.02B). Prediction accuracy decreases as time increases from 1 to 5 minutes, with DeepLOB showing the lowest accuracy and Order Model (1.02B) showing the highest.
Figure 5: Predicting stock price trends with MarS

Applications in detection tasks

For regulators, detecting systemic risks or market abuse is critical for market stability. LMM models typical market patterns, enabling the identification of anomalies by comparing real market trajectories with those generated by MarS. Figure 6 shows the differences in the spread distribution (i.e., the difference between the best buy and sell prices, which reflects asset liquidity) between simulated and real market trajectories during a confirmed malicious market manipulation incident. This comparison can uncover subtle deviations indicative of unusual activities, offering regulators effective tools for monitoring market integrity.

Figure 6: Three bar graphs comparing the distribution similarity of data across three different periods: pre-manipulation, manipulation period, and post-manipulation. Each graph shows the probability distribution for 2 types of data: Replay and Simulation. The distribution similarity scores are 0.870 for pre-manipulation, 0.835 for the manipulation period, and 0.873 for post-manipulation.
Figure 6: Spread correlation between simulated and real market during market manipulation 

Defining new FinTech scenarios 

Generative models can create tailored content from simple descriptions. In MarS, a mechanism generates specific order flows from natural language descriptions of market conditions. To address extreme conditions, researchers developed a control signal system using a hierarchical diffusion model to generate high-fidelity signals during rare events, such as stock market crashes and circuit breakers. This capability transforms broad descriptions into precise order flow controls. 

By integrating controlled order generation with real-time feedback, MarS creates a unified framework for prediction and detection tasks, redefining financial research, applications, and market understanding. Key applications include “What If” analyses and training environments for reinforcement learning algorithms in realistic market conditions. 

“What If” analysis for financial research

The question “What would happen if different sizes of trading orders were executed under different market conditions?” is crucial for understanding market behavior. Traditional methods, relying on real orders, experience, and assumptions, are costly and slow. Generative models provide a breakthrough solution.

Figure 7 illustrates how MarS can simulate market impact: the top left shows how buy orders affect asset price trajectories, while the top right presents market impact curves of different strategies, matching traditional patterns. Researchers also used MarS to generate large-scale simulated data, constructing a market impact model using ordinary differential equations (ODE). The bottom left of Figure 7) shows the derived impact formula, and the bottom right demonstrates its interpretability. These advancements highlight MarS’s potential to enhance “What If” research through deep market modeling. 

Figure 7: Composite image of four graphs related to sample research results for market impact of orders Using MarS. The top left graph shows mid-price over time with two lines representing simulation and replay actions. The top right graph displays market impact for different agent types over time. The bottom left graph illustrates the auto-correlation of market impact decay for learned ODE, base ODE, and synthetic Seq. The bottom right heatmap shows interaction weights of the learned ODE with various features on the x-axis and log-transformed time on the y-axis.
Figure 7: Sample research results for market impact of orders using MarS 

Training environments for reinforcement learning in financial markets

Reinforcement learning (RL) algorithms require controlled environments for testing and optimization. Financial market behaviors often manifest through order flow changes, impacting the market. If the simulation cannot reflect these impacts accurately, an RL algorithm may fail in real-world scenarios.

MarS provides high-fidelity generation and real-time feedback, creating a comprehensive environment for RL in finance. Figure 8 shows the training process of trading agents, highlighting significant improvements in performance over time and demonstrating MarS’s effectiveness as an RL training ground. 

Figure 8: Line graph titled 'Price Advantage' on the y-axis and 'Step' on the x-axis. The graph shows an upward trend as the steps increase.
Figure 8: Performance of reinforcement learning trading agents trained in MarS. During training, the agent’s performance improved significantly, showcasing MarS’s ability to aid in developing robust reinforcement learning algorithms for real market conditions. 

Disclaimer: The research mentioned in this article, conducted by Microsoft Research, focuses on scientific exploration, aiming to advance knowledge and provide theoretical and technological support for research and applications in the financial field. All studies adhere to Microsoft’s responsible AI guidelines, ensuring principles such as fairness, inclusiveness, reliability and safety, transparency, privacy, and accountability are maintained. The technologies and methods discussed are still under research and development, not yet forming any commercial products or services, nor constituting any financial solutions. Readers are advised to consult certified financial professionals before making any financial decisions. 

The post MarS: A unified financial market simulation engine in the era of generative foundation models appeared first on Microsoft Research.

Read More

Advances in run-time strategies for next-generation foundation models

Advances in run-time strategies for next-generation foundation models

A visual illustration of Medprompt performance on the MedQA benchmark. Moving from left to right on a horizontal line, the illustration shows how different Medprompt components and additive contributions improve accuracy starting with zero-shot at 81.7 accuracy, to random few-shot at 83.9 accuracy, to random few-shot, chain-of-thought at 87.3 accuracy, to kNN, few-shot, chain-of-thought at 88.4 accuracy, to ensemble with choice shuffle at 90.2 accuracy.

Groundbreaking advancements in frontier language models are progressing rapidly, paving the way for boosts in accuracy and reliability of generalist models, making them highly effective in specialized domains. As part of our ongoing exploration of foundation model capabilities, we developed Medprompt last year—a novel approach to maximize model performance on specialized domain and tasks without fine-tuning. By leveraging multiphase prompting, Medprompt optimizes inference by identifying the most effective chain-of-thought (CoT) examples at run time and drawing on multiple calls to refine output. When deployed with GPT-4, Medprompt achieved an impressive 90.2% accuracy on the MedQA benchmark (USMLE-style), outperforming all other methods. 

A line chart that plots the MedQA test accuracy (y-axis) over time (x-axis).  

Open AI o1-preview model achieves the highest result at 96.0% accuracy followed by Med-Gemini at 91.1%; GPT-4 (Medprompt) at 90.2%; Med PaLM 2 at 86.5; GPT-4 base at 86.1; Med PaLM at 67.2; GPT-3.5 base at 60.2, BioMedLM at 50.3; DRAGON at 47.5; BioLinkBERT at 45.1; PubMedBERT at 38.1.
Figure 1. Comparative analyses of performance of multiple models on MedQA.

Less than a year later, our tests show the OpenAI o1-preview demonstrated superior performance over Medprompt, reaching 96% on the same benchmark (Figure 1)—without using sophisticated prompt guidance and control. This advancement is driven by the model’s integration of run-time strategies at its core, enabling state-of-the-art results on medical licensing exams in the United States and Japan, medical subsets of the Massive Multitask Language Understanding (MMLU) benchmark, and nursing exams (NCLEX) as shown in Figure 2. 

A spider web chart plotting the performance of OpenAI o1-preview (0 shot ensemble) compared to GPT-4 (Medprompt) and GPT-4 (5 shot) model performance on medical challenge problems. o1-preview achieves state-of-the-art results on MedQA US (4-option), JMLE-2024, MedMCQA Dev, MMLU Anatomy, MMLU Medical Genetics, MMLU Professional Medicine, MMLU College Biology, and MMLU College Medicine, and NCLEX. GPT-4 (Medprompt) performed better than OpenAI o1-preview (0 shot ensemble) on MMLU Clinical Knowledge
Figure 2. Comparisons on a wide range of medical challenge benchmarks.

These results are notable, prompting us to publish our recent study, findings, and analyses, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond (opens in new tab). But the numbers are only part of the story. In this blog, we discuss prompting strategies to make the most of o1-preview models and other factors to consider as well as directions forward for run-time strategies.

Is o1-preview “just” fancy prompting? 

The introduction of the OpenAI o1 model series marks a significant shift from prior GPT models. Unlike GPT, o1 models are trained using reinforcement learning (RL) techniques that enable them to “think” before generating outputs. While Medprompt relies on a cascade of operations with GPT-4 at run time guided by a multistage prompt, the o1 series incorporates this run-time reasoning directly into its RL-based design. The built-in functionality enables the o1 models to significantly outperform even the best results using GPT-4 and Medprompt. The performance gains come with a notable tradeoff: its per-token cost was approximately six times that of GPT-4o at the time of our evaluation. While the results for GPT-4o with Medprompt fall short of o1-preview model performance, the combination offers a more cost-effective alternative. The cost-benefit tradeoffs are highlighted in the following figure, with the x-axis presented on a logarithmic scale.

A line chart plotting accuracy on the MedQA Test (y-axis) versus total cost on a logarithmic scale (x-axis). OpenAI o1-preview using 5x, 10x, and 15x Ensemble hover around 1000 total cost. OpenAI o1-preview using Tailored Prompt, Minimal Prompt, Few-shot, kNN Few-shot are around 100 total cost. GPT-4o with Medprompt is below 100; kNN Few-shot CoT, Few-shot CoT, and Few-Shot are at 10; Zero-shot is at 1. GPT-4-Turbo with Medprompt is at 200; kNN Few-shot CoT, Few-shot CoT, and Few-Shot hover near 50, Zero-shot is near 5.
Figure 3. Pareto frontier showing accuracy versus total API cost (log scale) on the MedQA benchmark (1273 questions total). o1-preview (Sep 2024) is compared with GPT-4o (Aug 2024) and GPT-4 Turbo (Nov 2023).

Can we prompt engineer o1-preview?

The o1-preview model exhibits distinct run-time behaviors compared to the GPT series. While some of our more dynamic prompting strategies performed better than expected with o1-preview models, our most tried-and-true strategy was anything but consistent throughout our evaluation. Figure 4 captures specific performance results for Tailored Prompt, Ensembling, and Few-Shot Prompting on o1-preview. Here’s a summary of our findings: 

  1. Tailored Prompt: While minimal prompting—like a brief, one-sentence description followed by a question—offered a strong baseline performance, detailed task descriptions were best for eliciting accurate responses.
  2. Ensembling: Generating multiple answers per question and using majority voting across different reasoning paths boosted reliability, while shuffling answers in runs produced richer reasoning chains and improved outcomes. Ensembling continues to yield consistent performance improvements.
  3. Few-Shot Prompting: Guiding the model with a few examples produced inconsistent results and, on average, decreased performance compared with GPT models.
Three charts show the accuracy of o1-preview when combined with Tailored Prompt, Ensemble, and 5-shot KNN based on an average baseline of medical benchmarks. Tailored Prompts improves accuracy from 94.2 to 94.7; Ensemble (15x) improves accuracy from 94.2 to 95.5; 5-shot KNN decreases accuracy from 94.2 to 93.7.
Figure 4. Tests of different prompting strategies across benchmark datasets.

Spotlight: blog post

GraphRAG auto-tuning provides rapid adaptation to new domains

GraphRAG uses LLM-generated knowledge graphs to substantially improve complex Q&A over retrieval-augmented generation (RAG). Discover automatic tuning of GraphRAG for new datasets, making it more accurate and relevant.


Do results stand in another language? 

A chart with two bar charts measuring the accuracy (y-axis) by short and long questions (x-axis) on the Japanese Medical Licensing Examination. The short question bar is slightly higher than the long question ratio for o1-preview (0-shot ensemble). The short question bar is about two points less accurate than the long question bar for o1-preview (0-shot). The short answer bar is a point more accurate than the long question bar for GPT-4o (Medprompt). The short question bar is one point less accurate than the long question bar for GPT-4o (0 shot).
Figure 5. JMLE-2024: National medical licensing exam held in Japan (Feb 2024).

We expanded our research to include a new multilingual benchmark based on the Japanese national medical licensing exam. The JMLE (Japanese Medical Licensing Examination) is written in Japanese and administered in February 2024, after the o1-preview model’s knowledge cutoff. Even without translation to English, the o1-preview model achieved a remarkable score of 98.2% accuracy (Figure 5), well above the exam’s minimum passing score of approximately 80%.  

Do reasoning tokens improve performance? 

For fun, we conducted tests to determine whether increasing the number of reasoning tokens could improve performance. Our findings showed that by adjusting the prompt, we were able to consistently increase the number of reasoning tokens used by o1-preview, and the increase was directly correlated with improved performance as demonstrated in Figure 6.

A chart plotting the impact of reasoning tokens on accuracy. JMLE achieved 95.3% accuracy for Quick Response Prompt and 96.7% accuracy for Extended Reasoning Prompt. MMLU achieved 94.9% accuracy for Quick Response Prompt and 94.7% accuracy for Extended Reasoning Prompt. MedQA achieved 94.3% accuracy for Quick Response Prompt and 95.1% accuracy for Extended Reasoning Prompt. USMLE Sample Exam achieved 92.6% accuracy for Quick Response Prompt and 93.1% accuracy for Extended Reasoning Prompt. USMLE Self Assessment achieved 91.8% accuracy for Quick Response Prompt and 92.2% accuracy for Extended Reasoning Prompt.
Figure 6. The effect of two prompting strategies that elicit variable length reasoning chains across benchmark datasets.

What’s the takeaway? 

Bottom line: There’s a little something for everyone when it comes to run-time strategies. We’re excited by the performance gains from GPT models to o1-preview models. While these improvements are significant, so is the cost. For those needing proven accuracy on a budget, Medprompt leveraging calls to GPT-4 is a viable option for medicine and beyond. We summarize the relative performance of prompting strategies in Figure 7 to determine the best option, or check out the paper for a detailed breakdown of every dataset, experimental configuration, and prompt template (opens in new tab)

A matrix that shows the relative performance of prompting strategies over baseline medical benchmarks. The top row from left to right are the results for baseline numbers: JMLE = 95.6%; MMLU = 94.6%; MedMCQA = 81.4%; MedQA = 94.9%; USMLE Sample Exam = 94.0%; USMLE Self Assessment = 91.8%. The second row from left to right, 5-shot Random baseline difference: JMLE = +1.2%; MMLU = -1.1%; MedMCQA = 0.0%; MedQA = -1.4%; USMLE Sample Exam = -0.4%; USMLE Self Assessment = -1.0%. The third row from left to right, 5-shot KNN baseline difference: JMLE = +0.6%; MMLU = -0.1%; MedMCQA = +1.2%; MedQA = -2.2%; USMLE Sample Exam = -0.3%; USMLE Self Assessment = -0.6%. The fourth row from left to right, Bootstrap Ensemble (5x) baseline difference: JMLE = +1.5%; MMLU = +0.1%; MedMCQA = +1.3%; MedQA = +0.7%; USMLE Sample Exam = +1.3%; USMLE Self Assessment = +1.0%. The fifth row from left to right, Bootstrap Ensemble (10x) baseline difference: JMLE = +1.4%; MMLU = +0.6%; MedMCQA = +1.5%; MedQA = +0.7%; USMLE Sample Exam = +1.3%; USMLE Self Assessment = +1.1%. The sixth row from left to right, Ensemble (15x) baseline difference: JMLE = +1.5%; MMLU = +0.6%; MedMCQA = +2.0%; MedQA = +1.1%; USMLE Sample Exam = +2.0%; USMLE Self Assessment = +1.3%. The seventh row from left to right, Tailored Prompt baseline difference: JMLE = +1.6%; MMLU = +0.4%; MedMCQA = +0.9%; MedQA = +0.2%; USMLE Sample Exam = +0.0%; USMLE Self Assessment = +0.4%. The eighth row from left to right, Tailored Bootstrap Ensemble (5x) baseline difference: JMLE = +2.2%; MMLU = +0.7%; MedMCQA = +1.8%; MedQA = +0.8%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.1%. The ninth row from left to right, Tailored Bootstrap Ensemble (10x) baseline difference: JMLE = +2.3%; MMLU = +0.7%; MedMCQA = +2.1%; MedQA = +0.9%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.2%. The tenth row from left to right, Tailored Ensemble (15x) baseline difference: JMLE = +2.5%; MMLU = +0.4%; MedMCQA = +2.6%; MedQA = +1.1%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.4%.
Figure 7. Heatmap showing absolute accuracy and relative performance over baseline zero-shot prompt (in parenthesis) across all benchmark datasets.

Anything more to consider?

We highlighted several considerations in the paper that are worth checking out. Here are three opportunities that are top of mind:

  • Research on run-time strategies. The research community has largely relied on boosting model capabilities with data, compute, and model size, predictably achieving gains by way of scaling laws. A promising new direction is inference-time scaling—the value of investing in additional computation and machinery for guiding inference at run time. We highlight in the paper opportunities to guide run-time allocations to boost efficiency, accuracy, and intellectual capabilities, including meta reasoning and reflection in real time and learning during the “idle” time (opens in new tab) between problem solving. We see a great deal of opportunity for new research and development on real-time and “offline” reasoning, learning, and reflection.
  • Benchmark saturation. With the rapid advancement of state-of-the-art models, many existing medical benchmarks are reaching “saturation,” where models perform extremely well on standing medical competency challenges, considered extremely difficult just a few years ago. Current benchmarks, such as USMLE and JMLE, were designed to assess the performance of medical students and clinicians and are increasingly inadequate for evaluating cutting-edge AI models. To drive understandings of models and guide research, we need to design more challenging medical benchmarks.
  • From benchmarks to clinical applications. We note that, while benchmarks offer valuable insights into performance and accuracy, they often fail to capture the complexities and nuances of real-world clinical decision making and healthcare delivery, more broadly. Conducting clinical trials to rigorously evaluate the impact of AI applications on patient care poses far greater difficulties than benchmarking models against challenge problems drawn from medical competency exams. Yet, studies of AI deployments in realistic clinical settings are essential for understanding model capabilities and for guiding the effective integration of AI into healthcare.

Resources 

The post Advances in run-time strategies for next-generation foundation models appeared first on Microsoft Research.

Read More

Accelerating drug discovery with TamGen: A generative AI approach to target-aware molecule generation

Accelerating drug discovery with TamGen: A generative AI approach to target-aware molecule generation

TamGen logo on a gradient background.

The Global Health Drug Discovery Institute (opens in new tab) (GHDDI) and Microsoft Research have reached a milestone in tuberculosis (TB) drug research with TamGen (opens in new tab), an open-source (opens in new tab), transformer-based chemical language model for developing target-specific drug compounds. Working in collaboration, the joint team successfully identified several promising inhibitors for a TB protease, with the most effective compound showing significant bioactivity. Research shows that TamGen can also optimize existing molecules by designing target-aware molecule fragments, potentially enabling the discovery of novel compounds that build on a known molecular core structure.  

Generative AI helps overcome limitations in drug discovery

Generative AI is opening new avenues for scientific exploration by allowing computers to autonomously learn and produce original content. TamGen offers a new approach to drug discovery by applying the principles of generative AI to molecular design. Unlike traditional methods, which depend on systematically screening known compounds—a process that is long, complex, and costly due to its reliance on empirical knowledge and the time-consuming task of exploring a vast chemical library—generative AI provides opportunities for designing entirely new chemical structures.  

TamGen goes beyond analyzing existing data by generating chemically diverse compounds that conventional approaches might miss. Figure 1 shows that generative AI expands chemical exploration, allowing for a deeper and more comprehensive search for therapeutic solutions compared to traditional methods.

Two funnels that illustrate compared to the traditional screening-based approach, the generative AI-based approach enables the exploration of a broader range of novel compounds. The table below the two funnels has two rows labeled “Valuable cmpd%” and “Novel cmpd” to compare “Screening-based” approach and “Generative-based” approach. The “Screening-based” column shows “Low”, “No”, “High” verse to “Generative-based” shows “High” “Yes”, “Low”
Figure 1. Compared with the traditional screening-based approach to drug discovery, a generative AI-based approach enables the discovery of novel compounds. 

TamGen workflow 

TamGen’s workflow uses generative AI to design target-specific chemical compounds. Building on the success of large language models (LLMs), we adapted a similar approach for molecular generation, using a training method like that of GPT models, which involves next-token prediction. Molecules were first converted into a simplified molecular input line entry system (SMILES)—a notation representing molecular structures as symbol sequences, similar to text. We then developed a protein encoder to process information about proteins, including their 3D structure.  

A contextual encoder combines insights from medical professionals with data on the protein target and existing compounds that have proven to be effective or promising. Using expert knowledge and computational analysis, this encoder guides the compound generator to produce new molecules that are more likely to bind to a given protein. This workflow is illustrated in Figure 2. 

The protein encoder process proteins information, including 3D information, to help with protein representation. The contextual encoder captures protein-ligand information to generate new molecules bind to proteins. Then the new molecules will be converted into SMILES and pre-trained a GPT-like model to generate molecular compounds.
Figure 2. TamGen’s workflow 

Evaluating TamGen computationally 

To evaluate TamGen’s performance, we compared it to five other common methods used to create 3D shapes of molecules intended to bind to certain proteins. We evaluated these methods using the CrossDocked benchmark, a dataset used in AI research to assess the quality of molecule generation conditioned on a target protein.

Evaluation metrics: 

  • Docking score: Measures how well a molecule binds to a target protein. 
  • Quantitative estimate of drug-likeness (QED): Assesses how good a candidate a molecule is for a drug. 
  • Synthesis accessibility score (SAS): Measures how easy or difficult it is to synthesize a particular chemical compound in a lab. 
  • Ro5 (Lipinski’s rule of five): Determines how likely a compound can be developed into an oral drug.  
  • LogP: Tests a compound’s ability to move between water and fats. 
  • Diversity: Measures the range of different molecular structures and properties in a collection of compounds.  

The findings, illustrated in Figure 3, show TamGen’s overall performance. While other methods may produce compounds that bind more strongly, they often include multiple interconnected ring structures. Research indicates that more of these structures can lower synthesis accessibility (SAS) and increase cellular toxicity, making these compounds harder to develop. We believe that molecular pretraining of the model contributed to the overall effectiveness of the compounds TamGen generated.

The figure uses the shade of the color to illustrate the performance of “Docking”, “QED”, “Lipinski”, “SAS”, “logP”, “Diversity” between TamGen 5 other methods including “Pocket2Mol”, “ResGen”, “TargeDiff”, “3D-AR”, “LiGAN”. The results show that TamGen achieved the best overall performance.
Figure 3. Results from TamGen’s computational performance verification

Experimental lab verification 

To ensure real-world applicability, we also validated our findings in a hands-on lab environment. Here, we focused on the ClpP protease in Mycobacterium tuberculosis as the target because it plays a significant role in the bacterium’s survival under stress conditions. We proposed the Design-Refine-Test pipeline to effectively identify molecular compounds for TB drug discovery.

Design stage: We began by using TamGen to analyze the binding pocket of the protease, where molecules can attach and influence its function. TamGen generated about 2,600 potential compounds that could fit into this pocket. We assessed these compounds based on how well they could attach to the protease and their predicted biological effects, narrowing it down to four promising candidates. 

Refine stage: Next, we entered the four compounds into TamGen, along with three molecular fragments that had been validated in previous lab experiments. This generated a total of 8,600 new compounds, which we screened again using the same criteria, eventually narrowing the selection to 296 compounds.

Test stage: Because synthesizing all 296 compounds wasn’t feasible, we identified similar compounds available in commercial libraries and tested their initial activity against TB. Five compounds showed promising results. We then synthesized one of the originals and two variants of another. Additionally, we categorized the generated compounds into clusters, selected the top 10% from each cluster based on docking scores, and after manual review, synthesized eight more compounds. 

The team from Microsoft Research generated the compounds by TamGen, and the GHDDI team conducted binding analysis, structure–activity relationship studies, and lab experiments to verify the compounds’ inhibitory effect on the ClpP protease, measuring their capacity to interfere with or reduce its activity. Lower IC50 values signify greater potency. Out of the 16 compounds tested, 14 showed strong inhibitory activity measuring under 40 µM, indicating high potential. The most effective compound had a measured IC50 value of 1.88 µM.

The figure shows step by step the process and results, as the context described, of how TamGen analyzes the binding pocket of the protease on Design stage, Refine stage and Test stage.
Figure 4. The hands-on lab verification process  

From molecule to fragment generation 

In addition to generating new molecules, TamGen can optimize existing ones by designing smaller molecular fragments. In this fragment generation process, TamGen builds on a given protein target and a molecular core structure to design new compounds around that core. By incorporating information about the target protein, it generates fragments that are highly specific to the target. This approach moves beyond traditional methods that rely on pre-existing databases, which often limit both novelty and effectiveness of molecular fragments.

For fragment generation, we adjusted the input to TamGen’s compound generator. We modified the SMILES string to ensure it ended at the desired growth site. This was done by specifying the fragment we wanted to retain and its connection point for further growth. The tailored SMILES string was then fed into the compound generator to extend the molecule. 

We evaluated this method by targeting the ClpP protease for TB, achieving a more than tenfold improvement in the binding affinity of the generated compound compared to the original. Some compounds also demonstrated slow binding, indicating potential for prolonged action and improved selectivity for the target protein.

AI’s potential in drug discovery 

TamGen showcases the transformative potential of generative AI in drug design, combining advanced molecular modeling with researcher-AI collaboration. Tasks that once took years can now be accomplished in a fraction of the time. This research underscores AI’s expanding role in drug discovery and its promise for developing effective treatments against persistent infectious diseases like TB. 

Looking ahead, we plan to integrate advanced techniques into TamGen, including diffusion models for generating 3D structures, reinforcement learning to apply physical constraints, and molecular dynamics simulations to capture proteins’ shifting shapes. These enhancements aim to improve how well generated compounds bind to target proteins, increase their ability to be synthesized, and strengthen other critical drug properties.

The post Accelerating drug discovery with TamGen: A generative AI approach to target-aware molecule generation appeared first on Microsoft Research.

Read More