AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop

AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop

Chris Bishop looking at the camera. A podcast microphone displayed.

Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come. 

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as healthcare and education, and its potential to benefit humanity. 

This episode features Technical Fellow Christopher Bishop (opens in new tab), who leads a global team of researchers and engineers working to help accelerate scientific discovery by merging machine learning and the natural sciences. Llorens and Bishop explore the state of deep learning; Bishop’s new textbook, Deep Learning: Foundations and Concepts (opens in new tab), his third and a writing collaboration with his son; and a potential future in which “super copilots” accessible via natural language and drawing on a variety of tools, like those that can simulate the fundamental equations of nature, are empowering scientists in their pursuit of breakthrough.

Chris Bishop with son and coauthor Hugh Bishop
Chris Bishop with son and coauthor Hugh Bishop

Transcript

[MUSIC PLAYS] 

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more excited to work in the field than right now. The latest foundation models and the systems we’re building around them are exhibiting surprising new abilities in reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’m sharing conversations with fellow researchers about the latest developments in AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers

Today, I’ll speak with Chris Bishop. Chris was educated as a physicist but has spent more than 25 years as a leader in the field of machine learning. Chris directs our AI4Science organization, which brings together experts in machine learning and across the natural sciences with the aim of revolutionizing scientific discovery.


[MUSIC FADES] 

So, Chris, you have recently published a new textbook on deep learning, maybe the new definitive textbook on deep learning. Time will tell. So, of course, I want to get into that. But first, I’d like to dive right into a few philosophical questions. In the preface of the book, you make reference to the massive scale of state-of-the-art language models, generative models comprising on the order of a trillion learnable parameters. How well do you think we understand what a system at that scale is actually learning? 

CHRIS BISHOP: That’s a super interesting question, Ashley. So in one sense, of course, we understand the systems extremely well because we designed them; we built them. But what’s very interesting about machine learning technology compared to most other technologies is that the, the functionality in large part is learned, is learned from data. And what we discover in particular with these very large language models is, kind of, emergent behavior. As we go up at each factor of 10 in scale, we see qualitatively new properties and capabilities emerging. And that’s super interesting. That, that was called the scaling hypothesis. And it’s proven to be remarkably successful. 

LLORENS: Your new book lays out foundations in statistics and probability theory for modern machine learning. Central to those foundations is the concept of probability distributions, in particular learning distributions in the service of helping a machine perform a useful task. For example, if the task is object recognition, we may seek to learn the distribution of pixels you’d expect to see in images corresponding to objects of interest, like a teddy bear or a racecar. On smaller scales, we can at least conceive of the distributions that machines are learning. What does it mean to learn a distribution at the scale of a trillion learnable parameters? 

BISHOP: Right. That’s really interesting. So, so first of all, the fundamentals are very solid. The fact that we have this, this, sort of, foundational rock of probability theory on which everything is built is extremely powerful. But then these emergent properties that we talked about are the result of extremely complex statistics. What’s really interesting about these neural networks, let’s say, in comparison with the human brain is that we can perform perfect diagnostics on them. We can understand exactly what each neuron is doing at each moment of time. And, and so we can almost treat the system in a, in a, sort of, somewhat experimental way. We can, we can probe the system. You can apply different inputs and see how different units respond. You can play games like looking at a unit that responds to a particular input and then perhaps amplifying the, amplifying that response, adjusting the input to make that response stronger, seeing what effect it has, and so on. So there’s an aspect of machine learning these days that’s somewhat like experimental neurobiology, except with the big advantage that we have sort of perfect diagnostics. 

LLORENS: Another concept that is key in machine learning is generalization. In more specialized systems, often smaller systems, we can actually conceive of what we might mean by generalizing. In the object recognition example I used earlier, we may want to train an AI model capable of recognizing any arbitrary image of a teddy bear. Because this is a specialized task, it is easy to grasp what we mean by generalization. But what does generalization mean in our current era of large-scale AI models and systems?

BISHOP: Right. Well, generalization is a fundamental property, of course. If we couldn’t generalize, there’d be no point in building these systems. And again, these, these foundational principles apply equally at a very large scale as they do at a, at a smaller scale. But the concept of generalization really has to do with modeling the distribution from which the data is generated. So if you think about a large language model, it’s trained by predicting the next word or predicting the next token. But really what we’re doing is, is creating a task for the model that forces it to learn the underlying distribution. Now, that distribution may be extremely complex, let’s say, in the case of natural language. It can convey a tremendous amount of meaning. So, really, the system is forced to … in order to get the best possible performance, in order to make the best prediction for the next word, if you like, it’s forced to effectively understand the meaning of the content of the data. In the case of language, the meaning of effectively what’s being said. And so from a mathematical point of view, there’s a very close relationship between learning this probability distribution and the problem of data compression, because it turns out if you want to compress data in a lossless way, the optimal way to do that is to learn the distribution that generates the data. So that’s,  that’s … we show that in the book, in fact. And so, and the best way to … let’s take the example of images, for instance. If you’ve got a very, very large number of natural images and you had to compress them, the most efficient way to compress them would be to understand the mechanisms by which the images come about. There are objects. You could, you could pick a car or a bicycle or a house. There’s lighting from different angles, shadows, reflections, and so on. And learning about those mechanisms—understanding those mechanisms—will give you the best possible compression, but it’ll also give you the best possible generalization. 

LLORENS: Let’s talk briefly about one last fundamental concept—inductive bias. Of course, as you mentioned, AI models are learned from data and experience, and my question for you is, to what extent do the neural architectures underlying those models represent an inductive bias that shapes the learning?

BISHOP: This is a really interesting question, as well, and it sort of reflects the journey that neural nets have been on in the last, you know, 30–35 years since we first started using gradient-based methods to train them. So, so the idea of inductive bias is that, actually, you can only learn from data in the presence of assumptions. There’s, actually, a theorem called the “no free lunch” theorem, which proves this mathematically. And so, to be able to generalize, you have to have data and some sort of assumption, some set of assumptions. Now, if you go back, you know, 30 years, 35 years, when I first got excited about neural nets, we had very simple one– and two–layered neural nets. We had to put a lot of assumptions in. We’d have to code a lot of human expert knowledge into feature extraction, and then the neural net would do a little bit of, the last little bit of work of just mapping that into a, sort of, a linear representation and then, then learning a classifier or whatever it was. And then over the years as we’ve learned to train bigger and richer neural nets, we can allow the data to have more influence and then we can back off a little bit on some of that prior knowledge. And today, when we have models like large-scale transformers with a trillion parameters learned on vast datasets, we’re letting the data do a lot of the heavy lifting. But there always has to be some kind of assumption. So in the case of transformers, there are inductive biases related to the idea of attention. So that’s a, that’s a specific structure that we bake into the transformer, and that turns out to be very, very successful. But there’s always inductive bias somewhere.

LLORENS: Yeah, and I guess with these new, you know, generative pretrained models, there’s also some inductive bias you’re imposing in the inferencing stage, just with your, with the way you prompt the system. 

BISHOP: And, again, this is really interesting. The whole field of deep learning has become incredibly rich in terms of pretraining, transfer learning, the idea of prompting, zero-shot learning. The field has exploded really in the last 10 years—the last five years—not just in terms of the number of people and the scale of investment, number of startups, and so on, but the sort of the richness of ideas and, and, and techniques like, like order differentiation, for example, that mean we don’t have to code up all the gradient optimization steps. It allows us to explore a tremendous variety of different architectures very easily, very readily. So it’s become just an amazingly exciting field in the last decade. 

LLORENS: And I guess we’ve, sort of, intellectually pondered here in the first few minutes the current state of the field. But what was it like for you when you first used, you know, a state-of-the-art foundation model? What was that moment like for you?  

BISHOP: Oh, I could remember it clearly. I was very fortunate because I was given, as you were, I think, a very early access to GPT-4, when it was still very secret. And I, I’ve described it as being like the, kind of, the five stages of grief. It’s a, sort of, an emotional experience actually. Like first, for me, it was, like, a, sort of, first encounter with a primitive intelligence compared to human intelligence, but nevertheless, it was … it felt like this is the first time I’ve ever engaged with an intelligence that was sort of human-like and had those first sparks of, of human-level intelligence. And I found myself going through these various stages of, first of all, thinking, no, this is, sort of, a parlor trick. This isn’t real. And then, and then it would do something or say something that would be really quite shocking and profound in terms of its … clearly it was understanding aspects of what was being discussed. And I’d had several rounds of that. And then, then the next, I think, was that real? Did I, did I imagine that? And go back and try again and, no, there really is something here. So, so clearly, we have quite a way to go before we have systems that really match the incredible capabilities of the human brain. But nevertheless, I felt that, you know, after 35 years in the field, here I was encountering the first, the first sparks, the first hints, of real machine intelligence. 

LLORENS: Now let’s get into your book. I believe this is your third textbook. You contributed a text called Neural Networks for Pattern Recognition in ’95 and a second book called Pattern Recognition and Machine Learning in 2006, the latter still being on my own bookshelf. So I think I can hazard a guess here, but what inspired you to start writing this third text?

BISHOP: Well, really, it began with … actually, the story really begins with the COVID pandemic and lockdown. It was 2020. The 2006 Pattern Recognition and Machine Learning book had been very successful, widely adopted, still very widely used even though it predates the, the deep learning revolution, which of course one of the most exciting things to happen in the field of machine learning. And so it’s long been on my list of things to do, to update the book, to bring it up to date, to include deep learning. And when the, when the pandemic lockdown arose, 2020, I found myself sort of imprisoned, effectively, at home with my family, a very, very happy prison. But I needed a project. And I thought this would be a good time to start to update the book. And my son, Hugh, had just finished his degree in computer science at Durham and was embarking on a master’s degree at Cambridge in machine learning, and we decided to do this as a joint project during, during the lockdown. And we’re having a tremendous amount of fun together. We quickly realized, though, that the field of deep learning is so, so rich and obviously so important these days that what we really needed was a new book rather than merely, you know, a few extra chapters or an update to a previous book. And so we worked on that pretty hard for nearly a couple of years or so. And then, and then the story took another twist because Hugh got a job at Wayve Technologies in London building deep learning systems for autonomous vehicles. And I started a new team in Microsoft called AI4Science. We both found ourselves extremely busy, and the whole project, kind of, got put on the back burner. And then along came GPT and ChatGPT, and that, sort of, exploded into the world’s consciousness. And we realized that if ever there was a time to finish off a textbook on deep learning, this was the moment. And so the last year has really been absolutely flat out getting this ready, in fact, ready in time for launch at NeurIPS this year. 

LLORENS: Yeah, you know, it’s not every day you get to do something like write a textbook with your son. What was that experience like for you? 

BISHOP: It was absolutely fabulous. And, and I hope it was good fun for Hugh, as well. You know, one of the nice things was that it was a, kind of, a pure collaboration. There was no divergence of agendas or any sense of competition. It was just pure collaboration. The two of us working together to try to understand things, try to work out what’s the best way to explain this, and if we couldn’t figure something out, we’d go to the whiteboard together and sketch out some maths and try to understand it together. And it was just tremendous fun. Just a real, a real pleasure, a real honor, I would say. 

LLORENS: One of the motivations that you articulate in the preface of your book is to make the field of deep learning more accessible for newcomers to the field. Which makes me wonder what your sense is of how accessible machine learning actually is today compared to how it was, say, 10 years ago. On the one hand, I personally think that the underlying concepts around transformers and foundation models are actually easier to grasp than the concepts from previous eras of machine learning. Today, we also see a proliferation of helpful packages and toolkits that people can pick up and use. And on the other hand, we’ve seen an explosion in terms of the scale of compute necessary to do research at the frontiers. So net, what’s your concept of how accessible machine learning is today?

BISHOP: I think you’ve hit on some good points there. I would say the field of machine learning has really been through these three eras. The first was the focus on neural networks. The second was when, sort of, neural networks went into the back burner. As you, you hinted there, there was a proliferation of different ideas—Gaussian processes, graphical models, kernel machines, support vector machines, and so on—and the field became very broad. There are many different concepts to, to learn. Now, in a sense, it’s narrowed. The focus really is on deep neural networks. But within that field, there has been an explosion of different architectures and different … and not only in terms of the number of architectures. Just the sheer number of papers published has, has literally exploded. And, and so it can be very daunting, very intimidating, I think, especially for somebody coming into the field afresh. And so really the value proposition of this book is distill out the, you know, 20 or so foundational ideas and concepts that you really need to understand in order to understand the field. And the hope is that if you’ve really understood the content of the book, you’d be in pretty good shape to pretty much read any, any paper that’s published. In terms of actually using the technology in practice, yes, on the one hand, we have these wonderful packages and, especially with all the differentiation that I mentioned before, is really quite revolutionary. And now you can, you can put things together very, very quickly, a lot of open-source code that you can quickly bolt together and assemble lots of different, lots of different things, try things out very easily. It’s true, though, that if you want to operate at the very cutting edge of large-scale machine learning, that does require resources on a very large scale. So that’s obviously less accessible. But if your goal is to understand the field of machine learning, then, then I hope the book will serve a good purpose there. And in one sense, the fact that the packages are so accessible and so easy to use really hides some of the inner workings, I would say, of these, of these systems. And so I think in a way, it’s almost too easy just to train up a neural network on some data without really understanding what’s going on. So, so the book is really about, if you like, the minimum set of things that you need to know about in order to understand the field, not just to, sort of, turn the crank on it on a package but really understand what’s going on inside. 

LLORENS: One of the things I think you did not set out to do, as you just mentioned, is to create an exhaustive survey of the most recent advancements, which might have been possible, you know, a decade or so ago. How do you personally keep up with the blistering pace of research these days? 

BISHOP: Ah, yes, it’s a, it’s a challenge, of course. So, so my focus these days is on AI4Science, AI for natural science. But that’s also becoming a very large field. But, you know, one of the, one of the wonderful things about being at Microsoft Research is just having fantastic colleagues with tremendous expertise. And so, a lot of what I learn is from, is from colleagues. And we’re often swapping notes on, you know, you should take a look at this paper, did you hear about this idea, and so on, and brainstorming things together. So a lot of it is, you know, just taking time each day to read papers. That’s important. But also, just conversations with, with colleagues. 

LLORENS: OK, you mentioned AI4Science. I do want to get into that. I know it’s an area that you’re passionate about and one that’s become a focus for your career in this moment. And, you know, I think of our work in AI4Science as creating foundation models that are fluent not in human language but in the language of nature. And earlier in this conversation, we talked about distribution. So I want to, kind of, bring you back there. Do you think we can really model all of nature as one wildly complex statistical distribution?

BISHOP: [LAUGHS] Well, that’s, that’s really interesting. I do think I could imagine a future, maybe not too many years down the road, where scientists will engage with the tools of scientific discovery through something like a natural language model. That model will also have understanding of concepts around the structures of molecules and the nature of data, will read scientific literature, and so on, and be able to assemble these ideas together. But it may need to draw upon other kinds of tools. So whether everything will be integrated into one, one overarching tool is less clear to me because there are some aspects of scientific discovery that are being, truly being revolutionized right now by deep learning. For example, our ability to simulate the fundamental equations of nature is being transformed through deep learning, and the nature of that transformation, on the one hand, it leverages, might leverage architectures like diffusion models and large language, large language models, large transformers, and the ability to train on large GPU clusters. But the fundamental goals there are to solve differential equations at a very large scale. And so the kinds of techniques we use there are a little bit different from the ones we’d use in processing natural language, for example. So you could imagine, maybe not too many years in the future, where a scientist will have a, kind of, “super copilot” that they can interact with directly in natural language. And that copilot or system of copilots can itself draw upon various tools. They may be tools that simulate Schrödinger equation, solves Schrödinger equation, to predict the properties of molecules. It might call upon large-scale deep learning emulators that can do a similar thing to the simulators but very, very much more efficiently. It might even call upon automated labs, wet labs, that can run experiments and gather data and can help the scientist marshal these resources and make optimal decisions as they go through that iterative scientific discovery process, whether inventing a new battery, electrolyte, or whether discovering a new drug, for example. 

LLORENS: We talked earlier about the “no free lunch” theorem and the concept of inductive bias. What does that look like here in training science foundation models?

BISHOP: Well, it’s really interesting, and maybe I’m a little biased because my background is in physics. I did a PhD in quantum field theory many decades ago. For me, one of the reasons that this is such an exciting field is that, you know, my own career has come full circle. I now get to combine machine learning with physics and chemistry and biology. I think the inductive bias here is, is particularly interesting. If you think about large language models, we don’t have very many, sort of, fundamental rules of language. I mean, the rules of linguistics are really human observations about the structure of language. But neural nets are very good at extracting that, that kind of structure from data. Whereas when we look at physics, we have laws which we believe hold very accurately. For example, conservation of energy or rotational invariance. The energy of a molecule in a vacuum doesn’t depend on its rotation in space, for example. And that kind of inductive bias is very rigorous. We believe that it holds exactly. And so there is … and also, very often, we want to train on data that’s obtained from simulators. So the training data itself is obtained by solving some of those fundamental equations, and that process itself is computationally expensive. So the data can often be in relatively limited supply. So you’re in a regime that’s a little bit different from the large language models. It’s a little bit more like, in a way, machine learning was, you know, 10 to 20 years ago, as you were talking about, where data, data is limited. But now we have these powerful and strong inductive biases, and so there’s, it’s a very rich field of research for how to build in those inductive biases into the machine learning models but in a way that retains computational efficiency. So I personally, actually, find this one of the most exciting frontiers not only of the natural sciences but also of machine learning. 

LLORENS: Yeah, you know, physics and our understanding of the natural world has come so far, you know, over the last, you know, centuries and decades. And yet our understanding of physics is evolving. It’s an evolving science. And so maybe I’ll ask you somewhat provocatively if baking our current understanding of physics into these models as inductive biases is limiting in some way, perhaps limiting their ability to learn new physics? 

BISHOP: It’s a great question. I think for the kinds of things that we’re particularly interested in, in Microsoft Research, in the AI4Science team, we’re very interested in things that have real-world applicability, things to do with drug discovery, materials design. And there, first of all, we do have a very good understanding of the fundamental equations, essentially Schrödinger equation and fundamental equations of physics, and those inductive biases such as energy conservation. We really do believe they hold very accurately in the domains that we’re interested in. However, there’s a lot of scientific knowledge that is, that represents approximations to that, because you can only really solve these equations exactly for very small systems. And as you start to get to larger, more complex systems, there are, as it were, laws of physics that aren’t, aren’t quite as rigorous, that are somewhat more empirically derived, where there perhaps is scope for learning new kinds of physics. And, certainly, as you get to larger systems, you get, you get emergent properties. So, so conservation of energy doesn’t get violated, but nevertheless, you can have a very interesting new emergent physics. And so it’s, from the point of view of scientific discovery, I think the field is absolutely wide open. If you look at solid-state physics, for example, and device physics, there’s a tremendous amount of exciting new research to be done over the coming decades.

LLORENS: Yeah, you alluded to this. I think maybe it’s worth just double clicking on for a moment because there is this idea of compositionality and emergent properties as you scale up, and I wonder if you could just elaborate on that a little bit. 

BISHOP: Yeah, that’s a good, that’s a good, sort of, picture to have this, sort of, hierarchy of different levels in the way they interact with each other. And at the very deepest level, the level of electrons, you might even more or less directly solve Schrödinger equation or do some very good approximation to that. That quickly becomes infeasible. And as you go up this hierarchy of, effectively, length scales, you have to make more and more approximations in order to be computationally efficient or computationally even practical. But in a sense, the previous levels of the hierarchy can provide you with training data and with validation verification of what you’re doing at the next level. And so the interplay between these different hierarchies is also very, very, very interesting. So at the level of electrons, they govern forces between atoms, which governs the dynamics of atoms. But once you look at larger molecules, you perhaps can’t simulate the behavior of every electron. You have to make some approximations. And then for larger molecules still, you can’t even track the behavior of every atom. You need some sort of coarse graining and so on. And so you have this, this hierarchy of different length scales. But every single one of those length scales is being transformed by deep learning, by our ability to learn from simulations, learn from those fundamental equations, in some cases, learn also from experimental data and build emulators, effectively, systems that can simulate that particular length scale and the physical and biological properties but do so in a way that’s computationally very efficient. So every layer of this hierarchy is currently being transformed, which is just amazingly exciting. 

LLORENS: You alluded to some of the application domains that stand to get disrupted by advancements in AI4Science. What are a couple of the applications that you’re most excited about? 

BISHOP: There are so many, it would be impossible to list them. But let me give you a couple of domains. I mean, the first one is, is healthcare and the ability to design new molecules, whether it’s small-molecule drugs or more protein-based therapies. That, that whole field is rapidly shifting to a much more computational domain, and that should accelerate our ability to develop new therapies, new drugs. The other class of domains has more to do with materials, and there are a lot of … the applications that we’re interested in relate to sustainability, things to do with capturing CO2 from the atmosphere, creating, let’s say, electricity from hydrogen, creating hydrogen from electricity. We need to do things both ways round. Just storing heat as a form of energy storage. Many, many applications relating to sustainability to do with, to do with protecting our water supply, to do with providing green energy, to do with storing and transporting energy. Many, many applications.

LLORENS: And at the core of all those advancements is deep learning as we’ve kind of started. And so maybe as we, as we close, we can, kind of, come back to your book on deep learning. I don’t have the physical book yet, but there’s a spot on my shelf next to your last book that’s waiting for it. But as we close here, maybe you can tell folks where to look for or how to get a copy of your new book. 

BISHOP: Oh, sure. It’s dead easy. You go to bishopbook.com, and from there, you’ll see how to order a hardback copy if that’s what you’d like, or there’s a PDF based e-book version. There’ll be a Kindle version, I believe. But there’s also a free-to-use online version on bishopbook.com, and it’s available there. It’s, sort of, PDF style and fully hyperlinked, free to use, and I hope people will read it, and enjoy it, and learn from it. 

LLORENS: Thanks for a fascinating discussion, Chris. 

BISHOP: Thanks, Ashley.

The post AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop appeared first on Microsoft Research.

Read More

Abstracts: December 12, 2023

Abstracts: December 12, 2023

Microsoft Research Podcast: Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Principal Research Manager Tao Qin and Senior Researcher Lijun Wu discuss “FABind: Fast and Accurate Protein-Ligand Binding.” The paper, accepted at the 2023 Conference on Neural Information Processing Systems (NeurIPS), introduces a new method for predicting the binding structures of proteins and ligands during drug development. The method demonstrates improved speed and accuracy over current methods.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today, I’m talking to Dr. Tao Qin, a Senior Principal Research Manager, and Dr. Lijun Wu, a Senior Researcher, both from Microsoft Research. Drs. Qin and Wu are coauthors of a paper titled “FABind: Fast and Accurate Protein-Ligand Binding,” and this paper—which was accepted for the 2023 Conference on Neural Information Processing Systems, or NeurIPS—is available now on arXiv. Tao Qin, Lijun Wu, thanks for joining us on Abstracts

LIJUN WU: Thanks. 

TAO QIN: Yeah, thank you. Yeah, it’s great to be here and to share our latest research. 

HUIZINGA: So, Tao, let’s start off with you. In a couple sentences, tell us what issue or problem your research addresses and, more importantly, why people should care about it.


QIN: Yeah, uh, we work on the problem of molecular docking, a computational modeling method used to predict the preferred orientation of one molecule when it binds to a second molecule to form a stable complex. So it aims to predict the binding pose of a ligand in the active site of a receptor and estimate the ligand-receptor binding affinity. This problem is very important for drug discovery and development. Accurately predicting binding poses can provide insights into how a drug candidate might bind to its biological target and whether it is likely to have the desired therapeutic effect. To make an analogy, just like a locker and a key, protein target is a locker, while the ligand is a key. We should carefully design the structure of the key so that it can perfectly fit into the locker. Similarly, the molecular structure should be accurately constructed so that the protein can be well bonded. Then the protein function would be activated or inhibited. Molecular docking is used intensively in the early stages of drug design and discovery to screen a large library of hundreds of thousands of compounds to identify promising lead compounds. It helps eliminate poor candidates and focus on experimental results of those most likely to bind to the target protein well. So clearly, improving the accuracy and also the speed of docking methods, like what we have done in this work, could accelerate the development of new life-saving drugs. 

HUIZINGA: So, Lijun, tell us how your approach builds on and/or differs from what’s been done previously in this field. 

WU: Sure, thanks, yeah. So conventional protein-ligand docking methods, they usually take the sampling and scoring ways. So … which … that means, they will use first some sampling methods to generate multiple protein-ligand docking poses as candidates. And then we will use some scoring functions to evaluate these candidates and select from them and to choose the best ones. So such as DiffDock, a very recent work developed by MIT, which is a very strong model to use the diffusion algorithm to do the sampling in this kind of way. And this kind of method, I say the sampling and scoring methods, they are accurate with good predictions, but of course, they are very slow. So this is a very big limitation because the sampling process usually takes a lot of time. So some other methods such as EquiBind or TANKBind, they treat the docking prediction as a regression task, which is to use deep networks to directly predict the coordinates of the atoms in the molecule. Obviously, this kind of method is much faster than the sampling methods, but the prediction accuracy is usually worse. So therefore, our FABind, which … aims to provide a both fast and accurate method for the docking problem. FABind keeps its fast prediction by modeling in a regression way, and also, we utilize some novel designs to improve its prediction accuracy. 

HUIZINGA: So, Lijun, let’s stay with you for a minute. Regarding your research strategy on this, uh, how would you describe your methodology, and how did you go about conducting this research? 

WU: OK, sure. So when we’re talking about the detailed method, we actually build an end-to-end deep learning framework, FABind, here. So for the protein-ligand docking, FABind divides the docking task as a pocket prediction process and also a pose prediction process. But importantly, we unify these two processes within a single deep learning model, which is a very novel equivalent graph neural network. Here, the pocket means a local part of the whole protein, which are some specific amino acids that can bind to the molecule in the structure space. So simply speaking, this novel graph neural network is stacked by some identity graph neural networks. And the graph neural layer is carefully designed by us, and we use the first graph layer for the pocket prediction and the later layers to do the pose prediction. And for each layer, there are some message passing operations we designed. The first one is an independent message passing, which is to update the information within the protein molecule itself. And the second one is the cross-attention messenger passing, which is to update the information between the whole protein and also the whole molecule so we can then let each other have a global view. And the last one is an interfacial messenger passing, which is to do the update, and we can message pass the information between the closed nodes between the protein and the molecule. So besides, there are also some small points that will help to get an accurate docking model. For example, we use a scheduled training technique to bridge the gap between the training and the inference stages. And also, we combine direct coordinate prediction and also the distance map refinement as our optimization method. 

HUIZINGA: Well, listen, I want to stay with you even more because you’re talking about the technical specifications of your research methodology. Let’s talk about results. What were your major findings on the performance of FABind?

WU: Yeah, the results are very promising. So first we need to care about the docking performance, which is the accuracy of the, uh, docking pose prediction. We compare our FABind to different baselines such as EquiBind, TANKBind, and also, I talked before about the recent strong model DiffDock, developed by MIT. So the results showed that our docking prediction accuracy are very good. They achieve a very competitive performance to the DiffDock like that. But specifically, we need to talk about that the speed is very important. When compared to DiffDock, we achieved about 170 times faster speed than DiffDock. So this is very promising. Besides, the interesting thing is that we found our FABind can achieve very, very strong performance on the unseen protein targets, which means that the protein structure that we have never seen before during the training, we can achieve very good performance. So our FABind achieves significantly better performance with about 10 percent to 40 percent accuracy improvement than DiffDock. This performance demonstrates that the practical effectiveness of our work is very promising since such kinds of new proteins are the most important ones that we need to care for a new disease. 

HUIZINGA: Tao, this is all fascinating, but talk about real-world significance for this work. Who does it help most and how? 

QIN: Yeah. As Lijun has introduced, FABind significantly outperforms earlier methods in terms of speed while maintaining competitive accuracy. This fast prediction capability is extremely important in real-world applications, where high-throughput virtual screening for compound selection is often required for drug discovery. So an efficient virtual screening process can significantly accelerate the drug discovery process. Furthermore, our method demonstrates great performance on unseen or new proteins, which indicates that our FABind possesses a strong generalization ability. This is very important. Consider the case of SARS-CoV-2, for example, where our knowledge of the protein target is very limited at the beginning of the pandemic. So if we have a robust docking model that can generalize to new proteins, we could conduct a large-scale virtual screening and, uh, confidently select potentially effective ligands. This would greatly speed up the development of new treatments. 

HUIZINGA: So downstream from the drug discovery science, benefits would accrue to people who have diseases and need treatment for those things. 

QIN: Yes, exactly. 

HUIZINGA: OK, well, Tao, let’s get an elevator pitch in here, sort of one takeaway, a golden nugget, uh, that you’d like our listeners to take away from this work. If, if there was one thing you wanted them to take away from the work, what would it be? 

QIN: Yeah, uh, thanks for a great question. So I think one sentence for takeaway is that if for some researchers, they are utilizing molecular docking and they are seeking an AI-based approach, our FABind method definitely should be in their consideration list, especially considering the exceptional predictive accuracy and the high computational efficiency of our method.

HUIZINGA: Finally, Tao, what are the big questions and problems that remain in this area, and what’s next on your research agenda? 

QIN: Actually, there are multiple unaddressed questions along this direction, so I think those are all opportunities for further exploration. So here I just give three examples. First, our method currently tackles rigid docking, where the target protein structure is assumed to be fixed, leaving only the ligand structure to be predicted. However, in a more realistic scenario, the protein is dynamic during molecular binding. So therefore, exploring flexible docking becomes an essential aspect. Second, our approach assumes that the target protein has only one binding pocket. In reality, a target protein may have multiple binding pockets. So this situation will be more challenging. So how to address such kind of significant challenge is worth exploration. Third, in the field of drug design, sometimes we need to find a target or we need to find a drug compound that can bind with multiple target proteins. In this work, we only consider a single target protein. So the accurate prediction of docking for multiple target proteins poses a great challenge. 

HUIZINGA: Well, Tao Qin and Lijun Wu, thank you for joining us today. And to our listeners, thanks for tuning in.  

[MUSIC PLAYS] 

If you’re interested in learning more about this work, you can find a link to the paper at aka.ms/abstracts or you can find it on arXiv. See you next time on Abstracts

[MUSIC FADES]

The post Abstracts: December 12, 2023 appeared first on Microsoft Research.

Read More

Steering at the Frontier: Extending the Power of Prompting

Steering at the Frontier: Extending the Power of Prompting

three conversation bubbles on a blue, purple, and pink gradient background

We’re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise. Even seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts. Beyond basic, out-of-the-box prompting, we’ve been exploring new prompting strategies, showcased in our Medprompt work, to evoke the powers of specialists.  

Today, we’re sharing information on Medprompt and other approaches to steering frontier models in promptbase (opens in new tab), a collection of resources on GitHub. Our goal is to provide information and tools to engineers and customers to evoke the best performance from foundation models. We’ll start by including scripts that enable replication of our results using the prompting strategies that we present here. We’ll be adding more sophisticated general-purpose tools and information over the coming weeks.  

As an illustration of the capabilities of the frontier models and on opportunities to harness and extend the recent efforts with reaching state-of-the-art (SoTA) results via steering GPT-4, we’ll review SoTA results on benchmarks that Google chose for evaluating Gemini Ultra. Our end-to-end exploration, prompt design, and computing of performance took just a couple of days.

MICROSOFT RESEARCH PODCAST

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

In this episode, PhD students Jennifer Scurrell and Alejandro Cuevas talk to Senior Researcher Dr. Madeleine Daepp. They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers to the teamwork they say helped make it possible for them to succeed, and the impact they hope to have with their work.


Let’s focus on the well-known MMLU (opens in new tab) (Measuring Massive Multitask Language Understanding) challenge that was established as a test of general knowledge and reasoning powers of large language models.  The complete MMLU benchmark contains tens of thousands of challenge problems of different forms across 57 areas from basic mathematics to United States history, law, computer science, engineering, medicine, and more.  

In our Medprompt study, we focused on medical challenge problems, but found that the prompt strategy could have more general-purpose application and examined its performance on several out-of-domain benchmarks—despite the roots of the work on medical challenges. Today, we report that steering GPT-4 with a modified version of Medprompt achieves the highest score ever achieved on the complete MMLU.

In our explorations, we initially found that applying the original Medprompt to GPT-4 on the comprehensive MMLU achieved a score of 89.1%. By increasing the number of ensembled calls in Medprompt from five to 20, performance by GPT-4 on the MMLU further increased to 89.56%. To achieve a new SoTA on MMLU, we extended Medprompt to Medprompt+ by adding a simpler prompting method and formulating a policy for deriving a final answer by integrating outputs from both the base Medprompt strategy and the simple prompts. The synthesis of a final answer is guided by a control strategy governed by GPT-4 and inferred confidences of candidate answers. More details on Medprompt+ are provided in the promptbase repo. A related method for coupling complex and simple queries was harnessed by the Google Gemini team. GPT-4 steered with the modified Medprompt+ reaches a record score of 90.10%. We note that Medprompt+ relies on accessing confidence scores (logprobs) from GPT-4. These are not publicly available via the current API but will be enabled for all in the near future.

A graph showing the reported performance of baseline multiple models and methods on the MMLU benchmark. Moving from left to right, Palm 2-L (5-shot) achieved 78.4% accuracy, Claude 2 (5-shot CoT) achieved 78.5% accuracy, Inflection-2 (5-shot) achieved 79.6% accuracy, Google Pro (CoT@8) achieved 79.13% accuracy, Gemini Ultra (CoT@32) achieved 90.04% accuracy, GPT-4-1106 (5-Shot) achieved 86.4% accuracy, GPT-4-1106 (Medprompt @ 5) achieved 89.1% accuracy, GPT-4-1106 (Medprompt @ 20) achieved 89.56% accuracy, and GPT-4-1106 (Medprompt @ 31) achieved 90.10% accuracy.
Figure1. Reported performance of multiple models and methods on the MMLU benchmark.

While systematic prompt engineering can yield maximal performance, we continue to explore the out-of-the-box performance of frontier models with simple prompts. It’s important to keep an eye on the native power of GPT-4 and how we can steer the model with zero- or few-shot prompting strategies. As demonstrated in Table 1, starting with simple prompting is useful to establish baseline performance before layering in more sophisticated and expensive methods.

Benchmark GPT-4 Prompt GPT-4 Results Gemini Ultra Results
MMLU Medprompt+ 90.10% 90.04%
GSM8K Zero-shot 95.27% 94.4%
MATH Zero-shot 68.42% 53.2%
HumanEval Zero-shot 87.8% 74.4%
BIG-Bench-Hard Few-shot + CoT* 89.0% 83.6% 
DROP Zero-shot + CoT 83.7% 82.4%
HellaSwag 10-shot** 95.3%** 87.8%
* followed the norm of evaluations and used standard few-shot examples from dataset creators 
** source: Google 

Table 1: Model, strategies, and results

We encourage you to check out the promptbase repo (opens in new tab) on GitHub for more details about prompting techniques and tools. This area of work is evolving with much to learn and share. We’re excited about the directions and possibilities ahead.

The post Steering at the Frontier: Extending the Power of Prompting appeared first on Microsoft Research.

Read More

Phi-2: The surprising power of small language models

Phi-2: The surprising power of small language models

Contributors

Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Anh Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, Yi Zhang

Satya Nadella on stage at Microsoft Ignite 2023 announcing Phi-2.
Figure 1. Satya Nadella announcing Phi-2 at Microsoft Ignite 2023.

Over the past few months, our Machine Learning Foundations team at Microsoft Research has released a suite of small language models (SLMs) called “Phi” that achieve remarkable performance on a variety of benchmarks. Our first model, the 1.3 billion parameter Phi-1 (opens in new tab), achieved state-of-the-art performance on Python coding among existing SLMs (specifically on the HumanEval and MBPP benchmarks). We then extended our focus to common sense reasoning and language understanding and created a new 1.3 billion parameter model named Phi-1.5 (opens in new tab), with performance comparable to models 5x larger.

We are now releasing Phi-2 (opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2 (opens in new tab) available on the Azure model catalog to foster research and development on language models.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


Key Insights Behind Phi-2

The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.

Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:

Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.

A bar plot comparing the performance of Phi-2 (with 2.7B parameters) and Phi-1.5 (with 1.3B parameters) on common sense reasoning, language understanding, math, coding, and the Bigbench-hard benchmark. Phi-2 outperforms Phi1.5 in all categories. The commonsense reasoning tasks are PIQA, WinoGrande, ARC easy and challenge, and SIQA. The language understanding tasks are HellaSwag, OpenBookQA, MMLU, SQuADv2, and BoolQ. The math task is GSM8k, and coding includes the HumanEval and MBPP benchmarks.
Figure 2. Comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) models. All tasks are evaluated in 0-shot except for BBH and MMLU which use 3-shot CoT and 5-shot, respectively.

Training Details

Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned. Despite this, we observed better behavior with respect to toxicity and bias compared to existing open-source models that went through alignment (see Figure 3). This is in line with what we saw in Phi-1.5 due to our tailored data curation technique, see our previous tech report (opens in new tab) for more details on this. For more information about the Phi-2 model, please visit Azure AI | Machine Learning Studio (opens in new tab).

A barplot comparing the safety score of Phi-1.5, Phi-2, and Llama-7B models on 13 categories of the ToxiGen benchmark. Phi-1.5 achieves the highest score on all categories, Phi-2 achieves the second-highest scores and Llama-7B achieves the lowest scores across all categories.
Figure 3. Safety scores computed on 13 demographics from ToxiGen. A subset of 6541 sentences are selected and scored between 0 to 1 based on scaled perplexity and sentence toxicity. A higher score indicates the model is less likely to produce toxic sentences compared to benign ones.

Phi-2 Evaluation

Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report “Textbooks Are All You Need.” Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).

Model Size BBH Commonsense
Reasoning
Language
Understanding
Math Coding
Llama-2 7B 40.0 62.2 56.7 16.5 21.0
13B 47.8 65.0 61.9 34.2 25.4
70B 66.5 69.2 67.6 64.1 38.3
Mistral 7B 57.2 66.4 63.7 46.4 39.4
Phi-2 2.7B 59.2 68.8 62.0 61.1 53.7
Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.
Model Size BBH BoolQ MBPP MMLU
Gemini Nano 2 3.2B 42.4 79.3 27.2 55.8
Phi-2 2.7B 59.3 83.3 59.1 56.7
Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.

In addition to these benchmarks, we also performed extensive testing on commonly used prompts from the research community. We observed a behavior in accordance with the expectation we had given the benchmark results. For example, we tested a prompt used to probe a model’s ability to solve physics problems, most recently used to evaluate the capabilities of the Gemini Ultra model, and achieved the following result:

An example prompt is given to Phi-2 which says “A skier slides down a frictionless slope of height 40m and length 80m. What's the skier’s speed at the bottom?”. Phi-2 then answers the prompt by explaining the conversion of potential energy to kinetic energy and providing the formulas to compute each one. It then proceeds to compute the correct speed using the energy formulas.
Figure 4. Phi-2’s output on a simple physics problem, which includes an approximately correct square root calculation.
The model is then provided with a student’s wrong answer to the skier physics problem and asked if it can correct the student’s mistake. Phi-2 replies with the student’s mistake, i.e., using the wrong formula for potential energy, and provides the correct formula.
Figure 5. Similarly to Gemini’s test we also further queried Phi-2 with a student’s wrong answer to see if Phi-2 could identify where the mistake is (it did, despite Phi-2 being not fine-tuned for chat or instruction-following). We note however that it is not fully an apple-to-apple comparison with the Gemini Ultra’s output described in the Gemini report, in particular in the latter case the student’s answer was given as an image with handwritten text rather than raw text in our case.

The post Phi-2: The surprising power of small language models appeared first on Microsoft Research.

Read More

Abstracts: December 11, 2023

Abstracts: December 11, 2023

Microsoft Research Podcast: Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Principal Researcher Alessandro Sordoni joins host Gretchen Huizinga to discuss “Joint Prompt Optimization of Stacked LLMs using Variational Inference.” In the paper, which was accepted at the 2023 Conference on Neural Information Processing Systems (NeurIPS), Sordoni and his coauthors introduce Deep Language Networks, or DLNs, an architecture that treats large language models as layers within a network and natural language prompts as each layer’s learnable parameters.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Dr. Alessandro Sordoni, a Principal Researcher from Microsoft Research. Dr. Sordoni is coauthor of a paper titled “Joint Prompt Optimization of Stacked LLMs using Variational Inference,” and this paper, which was accepted for the 2023 Conference on Neural Information Processing Systems, or NeurIPS, is available now on arXiv. Alessandro, thanks for joining us on Abstracts!


ALESSANDRO SORDONI: Hi, Gretchen, thank you for having me.

HUIZINGA: So in a few sentences, tell us about the issue or problem that your research addresses and why we should care about it.

SORDONI: So in this paper, our starting points are large language models, and to make large language models solve tasks, one of the ways that is currently used is to prompt them. By prompting that means just giving instruction to them, and hopefully by joining instruction and the input of the task, the language model can solve the task following the rules specified in the instructions. And there has been some approaches already in the literature to actually infer what that instruction is without human intervention. And in this paper, we operate in that space, which is called kind of automatic prompt engineering. And our specific problem is to, one, how to actually infer those prompts for a language model. And, two, what happens if actually the output of that large language model gets into another language model and both language model needs prompt to operate? And so basically, we give sort of an algorithm to solve that joint prompt optimization. That’s why it’s called joint.

HUIZINGA: So what’s the underlying issue there that we should care about as potential users of this technology?

SORDONI: There are some problems that cannot be just solved by kind of one instruction or rule, I would say, but they necessitate some sort of higher-level reasoning or some sort of decomposition. And in that sense, it would maybe be useful to actually have multiple calls to the LLM, where each call is modulated by a different instruction. So the first instruction could be something very general, for example, decompose or visualize the problem into a different language that is formulated in. And the second call is now recompose this visualization that you have produced to solve the problem itself. And so basically, in that context, you can think about this as kind of augmenting the computational power of the language model by splitting the one call in multiple calls.

HUIZINGA: Well, go in a little deeper on the work that this builds on. All research kind of gets a prompt—no pun intended—from previous work. So how does your work build on and/or differ from what’s been done previously in this field?

SORDONI: I would say that our work started more with this intuition that LLMs are just kind of black-box computation units. Now this sort of black box can accept input as input language. The computation is modulated by an instruction and it outputs language, so you can stack these layers, right. So if the weights of this language layer now are the instructions and you can stack them together, how can you optimize them, right? And then we start to think, OK, but this is very related to kind of automatic prompt optimization. The overall kind of prompt engineering and prompt optimization approaches right now work by proposing some prompts and accepting some prompts. So we did some modifications with respect to how we propose new prompts to language models and how do we evaluate and accept then those that work given some task inputs and outputs. Our goal in the future—I would say in the near future—is going to be to basically integrate optimization that can really express arbitrary graphs …

HUIZINGA: Gotcha …

SORDONI: … of LLM calls right now. But in our paper, we started with the first step, which is, OK, say that I just have two calls. Can I just optimize prompts for that very simple graph? And we proposed an algorithm to do so. So basically, I guess our main contribution is, one, getting a better prompt optimizer for one layer and also devising an algorithm that works for two layers right now and that can be extended to multiple layers. But that’s also an engineering problem that needs to be tackled.

HUIZINGA: [LAUGHS] Yeah, always got to get the engineering in there! Well, listen, let’s keep going on this because it sounds like you’re talking about methodology and, and how you conducted this research. So expand a little bit on what you did actually to experiment in this arena.

SORDONI: Yeah, so I think that, uh, really the birth of this paper started from this kind of view of these language models as layers modulated by instructions that can be stacked upon each other. And from there, we said, OK, what can we do with this, basically? And so some of us worked on datasets that could be interesting for this new sort of methodology, I would say, or architecture. So basically, one question was, how do you go forward to actually test if this works in any way? And so we tried to select some datasets that were more of natural language tasks—for example, sentiment classification—and some datasets that were more about reasoning tasks. And our hunch was that basically stacking multiple layers together would help more in those tasks that would require some sort of decomposition of reasoning.

HUIZINGA: Right.

SORDONI: And for the reasoning task, we worked with this BIG-Bench Hard setting. And so parallel to that, there were some of us that worked, for example myself, in the optimization part, really in the algorithm part. And at first, we tried to do some sort of back propagation. But I quickly saw that there were some sort of issues with that … probably empirically issues. And so we tried to actually have a more formal understanding of this optimization algorithm by recurring to variational inference basically, so basically, to understand actually the first layer as producing some text and considering this text as a latent variable. When you open that box, it links also in your head to all … a bunch of kind of related works in the literature that have studied this problem very, very thoroughly. And so you can use those techniques into this context.

HUIZINGA: Interesting. So what were the results of this research? What did you find?

SORDONI: So what we found was that, indeed, the tasks in which these approaches seem to help the most are the tasks that require this sort of decomposition and reasoning. The first thing that was really, really kind of cool, it was that kind of you can go a long way in improving the performance of these language models by accurate prompt optimization. Because in some models, prompt optimization can be understood as kind of really tweaking the models towards solving the task. But in some other tasks, actually, when humans write prompts, they tend to maybe underspecify the prompt or tend to basically be not very clear to how to instruct the model. So the model has to do a lot of work to understand …

HUIZINGA: Right …

SORDONI: … what the human really wants to say to them. And so basically, this sort of prompt optimization acts as a sort of translator where it formulates a prompt that more comprehensively describes the task and more comprehensively contains some rules to solve the task. So it was very interesting to me, that kind of level of abstraction that was sort of required and needed in the prompt to really solve this task very, very well. The other finding is that this problem is very hard. It’s very tricky to optimize, to prompt, this type of optimization because this type of optimization doesn’t really follow a gradient direction like in deep neural networks.

HUIZINGA: Yeah.

SORDONI: It’s basically a sort of trial and error. And this trial and error is very finicky. There’s a lot of problems there. But I feel like I’m hopeful in the sense that this paper allowed us, I think, to hone in some very specific problem that if we solve them, we can make the problem much easier.

HUIZINGA: Let’s talk for a second about real-world impact of this research. Let’s extrapolate out from the lab and move into life. Who benefits from this most, and how do they benefit?

SORDONI: I think that, as I said before, like these automatic prompt optimization methods could benefit, I think, a large audience, or large amount of users, I would say, because they could be understood as a sort of translator between the user needs and what the LLM can do. For example, one effort here in Montréal that was led by my colleagues was kind of building this sort of interactive agent that would, through interaction with the user, form a prompt but interactively. So, for example, in DLN, like in our paper, we assume that we have a task and we do not have input or interaction with the user, right. But in more realistic scenarios, you might want to actually instruct your model to do something by some sort of active learning process where the model actually propose you whether what it did was favorable or desirable or not.

HUIZINGA: Right.

SORDONI: And the user can actually interact with that output, right. For the multilayer case, my hope is that that would be useful to build and optimize these large sort of graphs of LLM calls.

HUIZINGA: I want to take a second here to spell out some acronyms. You’ve referred to DLNs, and I don’t think our audience might know what that means. I’m assuming they know LLM means “large language model.” That’s sort of in the parlance. But talk a little bit about what that other acronym is.

SORDONI: Yeah, sorry I didn’t mention this. So DLN is basically how we refer to these architectures that are composed of language model layers. So DLN is, spells as “Deep Language Network.”

HUIZINGA: Gotcha.

SORDONI: People are free to use this name or not.

HUIZINGA: No, I like it …

SORDONI: I’m not a big fan of imposing acronyms on the world [LAUGHS], but that’s a, that’s a shorter version of it. So, yeah, so it’s really the idea that a language model is a layer in this hierarchy, and the layer accepts as input a text, it outputs a text, and really is modulated by an instruction or prompt that we want to learn.

HUIZINGA: And so the DLN is a deep language network and it sort of acts as a deep neural network but using language models as your layer.

SORDONI: Exactly, exactly, yes.

HUIZINGA: So this is a question I ask everyone, and it’s sort of like, how could you boil this down to one little takeaway if you’re standing on an elevator with somebody and they say, what do you do, Alessandro? So if there’s one thing you want people to take away from this work, what would it be?

SORDONI: The first thing that came to my mind is really the fact that these models can be understood really as a class, I would say, of probability distributions and that they are modulated by these prompts. And so basically, once you have that, once a language model just defines a (p) over sentences given some prompt, you can apply a lot of algorithms with those models. You can apply algorithms that resembles to EM, expectation maximization, or … I mean, we applied a form of that with variational inference, but maybe kind of it could open the path for other types of usages where kind of these are just very, very powerful probability distributions over these sentences that are considered as latent variable. I hope that our paper can show like a more or less practical kind of implementation of that idea. And that basically if you have to optimize, for example, prompts with one or two layers, you can definitely try our approach.

HUIZINGA: Well, finally, and we’ve been talking about this kind of already, but there seem to be some unresolved problems in the area. What do researchers like you need to be looking at in order to solve those? Sort of what’s next on the research agenda, whether it’s you or other researchers in this field?

SORDONI: So let me try to answer by something that really excites me now. What we are doing is that we are producing text, right. With the language model. But we are producing this text in such a way that it helps to solve a problem. And basically, this variational inference method and kind of framework gives us a way of understanding what does it mean to be a good text? Like what does it mean to be a good kind of latent variable or useful latent variable?

HUIZINGA: Right.

SORDONI: What does it mean to produce good data? So, for example, these big models kind of are really data creators, like this generative AI, right. But can we actually teach them to produce data such that this data can be helpful to solve tasks or to condition those same models to solve a task?

HUIZINGA: Right.

SORDONI: And what are the objective functions that promote the production of this useful data? What useful means from a mathematical perspective. I think that, apart from the prompt optimization angle, I feel like DLN to me kind of opened a little bit my spirit into kind of investigating ways of understanding what does it mean for some generated text to be useful to solve a task, I would say. Yeah.

HUIZINGA: Alessandro Sordoni, thanks for joining us today. And thanks to our listeners for tuning in. If you’re interested in learning more about this work, you can find a link to the paper at aka.ms/abstracts or you can find it on arXiv. See you next time on Abstracts!

The post Abstracts: December 11, 2023 appeared first on Microsoft Research.

Read More

NeurIPS 2023 highlights breadth of Microsoft’s machine learning innovation

NeurIPS 2023 highlights breadth of Microsoft’s machine learning innovation

Research Focus: NeurIPS
December 11, 2023

Microsoft is proud to sponsor the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). This interdisciplinary forum brings together experts in machine learning, neuroscience, statistics, optimization, computer vision, natural language processing, life sciences, natural sciences, social sciences, and other adjacent fields. We are pleased to share that Microsoft has over 100 accepted papers and is offering 18 workshops at NeurIPS 2023. 

This year’s conference includes three papers from Microsoft that were chosen for oral presentations, which feature groundbreaking concepts, methods, or applications, addressing pressing issues in the field. Additionally, our spotlights posters, also highlighted below, have been carefully curated by conference organizers, exhibiting novelty, technical rigor, and the potential to significantly impact the landscape of machine learning. This blog post celebrates those achievements.

Oral Presentations

Bridging Discrete and Backpropagation: Straight-Through and Beyond

Gradient computations are pivotal in deep learning’s success, yet they predominantly depend on backpropagation, a technique limited to continuous variables. The paper Bridging Discrete and Backpropagation: Straight-Through and Beyond, tackles this limitation. It introduces ReinMax, extending backpropagation’s capability to estimate gradients for models incorporating discrete variable sampling. Within extensive experiments of this study, ReinMax demonstrates consistent and significant performance gain over the state of the art. More than just a practical solution, the paper sheds light on existing deep learning practices. It elucidates that the ‘Straight-Through’ method, once considered merely a heuristic trick, is actually a viable first-order approximation for the general multinomial case. Correspondingly, ReinMax achieves second-order accuracy in this context without the complexities of second-order derivatives, thus having negligible computation overheads. 

MICROSOFT RESEARCH PODCAST

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

In this episode, PhD students Jennifer Scurrell and Alejandro Cuevas talk to Senior Researcher Dr. Madeleine Daepp. They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers to the teamwork they say helped make it possible for them to succeed, and the impact they hope to have with their work.


The MineRL BASALT Competition on Learning from Human Feedback

The growth of deep learning research, including its incorporation into commercial products, has created a new challenge: How can we build AI systems that solve tasks when a crisp, well-defined specification is lacking? To encourage research on this important class of techniques, researchers from Microsoft led The MineRL BASALT Competition on Learning from Human Feedback (opens in new tab), an update to a contest first launched in 2021 (opens in new tab) by researchers at the University of California-Berkeley and elsewhere. The challenge of this competition was to complete fuzzy tasks from English language descriptions alone, with emphasis on encouraging different ways of learning from human feedback as an alternative to a traditional reward signal. 

The researchers designed a suite of four tasks in Minecraft for which writing hardcoded reward functions would be difficult. These tasks are defined by natural language: for example, “create a waterfall and take a scenic picture of it”, with additional clarifying details. Participants must train a separate agent for each task. Agents are then evaluated by humans who have read the task description.

The competition aimed to encourage development of AI systems that do what their designers intended, even when the intent cannot be easily formalized. Besides allowing AI to solve more tasks, this can also enable more effective regulation of AI systems, as well as making progress on value alignment problems, in which the specified objectives of an AI agent differ from those of its users.

Related

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

This comprehensive evaluation platform aims to answer the question: How trustworthy are generative pre-trained transformer (GPT) models? In DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models, researchers focus specifically on GPT-4, GPT-3.5, and a series of open LLMs. They consider diverse perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.

The researchers’ evaluations identified previously unpublished vulnerabilities relating to trustworthiness. The team worked with Microsoft product groups to confirm that the potential vulnerabilities identified do not impact current customer-facing services. This is in part true because finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology. They also shared their findings with GPT’s developer, OpenAI, which has noted the potential vulnerabilities in the system cards for relevant models.

This research aims to encourage others in the research community to utilize and build upon this work, potentially pre-empting adversaries who would exploit vulnerabilities to cause harm. To facilitate collaboration, the benchmark code is very extensible and easy to use: a single command is sufficient to run the complete evaluation on a new model.

Spotlight Posters

Differentially Private Approximate Near Neighbor Counting in High Dimensions

Differential privacy (DP) is a widely used tool for preserving the privacy of sensitive personal information. It allows a data structure to provide approximate answers to queries about the data it holds, while ensuring that the removal or addition of a single database entry does not significantly affect the outcome of any analysis.

Range counting (counting the number of data points falling into a given query ball) under differential privacy has been studied extensively. However, current algorithms for this problem come with challenges. One class of algorithms suffers from an additive error that is a fixed polynomial in the number of points. Another class of algorithms allows for polylogarithmic additive error, but the error grows exponentially in the dimension. To achieve the latter, the problem is relaxed to allow a “fuzzy” definition of the range boundary, e.g., a count of the points in a ball of radius r might also include points in a ball of radius cr for some c > 1.

In Differentially Private Approximate Near Neighbor Counting in High Dimensions, researchers present an efficient algorithm that offers a sweet spot between these two classes. The algorithm has an additive error that is an arbitrary small power of the data set size, depending on how fuzzy the range boundary is, as well as a small (1 + o(1)) multiplicative error. Crucially, the amount of noise added has no dependence on the dimension. This new algorithm introduces a variant of Locality-Sensitive Hashing, utilizing it in a novel manner.

Exposing Attention Glitches with Flip-Flop Language Modeling

Why do large language models sometimes output factual inaccuracies and exhibit erroneous reasoning? The brittleness of these models, particularly when executing long chains of reasoning, seems to be an inevitable price to pay for their advanced capabilities of coherently synthesizing knowledge, pragmatics, and abstract thought.

To help make sense of this fundamentally unsolved problem, Exposing Attention Glitches with Flip-Flop Language Modeling identifies and analyzes the phenomenon of attention glitches, in which the Transformer architecture’s inductive biases intermittently fail to capture robust reasoning. To isolate the issue, the researchers introduce flip-flop language modeling (FFLM), a parametric family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models. This simple generative task requires a model to copy binary symbols over long-range dependencies, ignoring the tokens in between. This research shows how Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which can be eliminated using various regularization techniques. The preliminary mechanistic analyses show why the remaining errors may be very difficult to diagnose and resolve. The researchers hypothesize that attention glitches account for some of the closed-domain errors occurring in natural LLMs.

In-Context Learning Unlocked for Diffusion Models

An emergent behavior of large language models (LLMs) is the ability to learn from context, or in-context learning. With a properly designed prompt structure and in-context learning, LLMs can combine the pre-training of multiple language tasks and generalize well to previously unseen tasks. While in-context learning has been extensively studied in natural language processing (NLP), its applications in the field of computer vision are still limited.

In-Context Learning Unlocked for Diffusion Models presents Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images and text guidance, this model understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, the researchers propose a vision-language prompt that can model a wide range of vision-language tasks, and a diffusion model that takes it as input. The diffusion model is trained jointly over six different tasks using these prompts. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation on the trained tasks and generalizes to new, unseen vision tasks with their respective prompts. This model also shows compelling text-guided image editing results.

Optimizing Prompts for Text-to-Image Generation

Generative foundation models can be prompted to follow user instructions, including language models and text-to-image models. Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, Optimizing Prompts for Text-to-Image Generation proposes prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts.

The researchers use reinforcement learning to explore better prompts with a language model. They define a reward function that encourages the policy network (i.e., language model) to generate more aesthetically pleasing images while preserving the original user intentions. Experimental results on Stable Diffusion show that this method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. Reinforcement learning further boosts performance, especially on out-of-domain prompts.

Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck

Algorithm design in deep learning can appear to be more like “hacking” than an engineering practice. There are numerous architectural choices and training heuristics, which can often modulate model performance and resource costs in unpredictable and entangled ways. As a result, when training large-scale neural networks (such as state-of-the-art language models), algorithmic decisions and resource allocations are foremost empirically-driven, involving the measurement and extrapolation of scaling laws. A precise mathematical understanding of this process is elusive, and cannot be explained by statistics or optimization in isolation.

In Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck, researchers from Microsoft, Harvard, and the University of Pennsylvania explore these algorithmic intricacies and tradeoffs through the lens of a single synthetic task: the finite-sample sparse parity learning problem. In this setting, the above complications are not only evident, but also provable: intuitively, due to the task’s computational hardness, a neural network needs a sufficient combination of resources (“data × model size × training time × luck”) to succeed. This research shows that standard algorithmic choices in deep learning give rise to a Pareto frontier, in which successful learning is “bought” with interchangeable combinations of these resources. They show that algorithmic improvements on this toy problem can transfer to the real world, improving the data-efficiency of neural networks on small tabular datasets.

PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers

Time-dependent partial differential equations (PDEs) are ubiquitous in science and engineering. The high computational cost of traditional solution techniques has spurred increasing interest in deep neural network based PDE surrogates. The practical utility of such neural PDE solvers depends on their ability to provide accurate, stable predictions over long time horizons, which is a notoriously hard problem.

PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers presents a large-scale analysis of common temporal rollout strategies, identifying the neglect of non-dominant spatial frequency information, often associated with high frequencies in PDE solutions, as the primary pitfall limiting stable, accurate rollout performance. Motivated by recent advances in diffusion models, the researchers developed PDE-Refiner, a novel model class that enables more accurate modeling of all frequency components via a multistep refinement process. They validate PDE-Refiner on challenging benchmarks of complex fluid dynamics, demonstrating stable and accurate rollouts that consistently outperform state-of-the-art models, including neural, numerical, and hybrid neural-numerical architectures. They also demonstrate that PDE-Refiner greatly enhances data efficiency, since the denoising objective implicitly induces a novel form of spectral data augmentation. Finally, PDE-Refiner’s connection to diffusion models enables an accurate and efficient assessment of the model’s predictive uncertainty, allowing researchers to estimate when the surrogate becomes inaccurate.

Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations

Randomized experiments are the gold-standard method of determining causal effects, whether in clinical trials to evaluate medical treatments or in A/B tests to evaluate online product offerings. But randomized experiments often need to be stopped prematurely when the treatment or test causes an unintended harmful effect. Existing methods that determine when to stop an experiment early are typically applied to the data in aggregate and do not account for treatment effect heterogeneity.

Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations examines the early stopping of experiments for harm on heterogeneous populations. The paper shows that current methods often fail to stop experiments when the treatment harms a minority group of participants. The researchers use causal machine learning to develop Causal Latent Analysis for Stopping Heterogeneously (CLASH), the first broadly-applicable method for heterogeneous early stopping. They demonstrate CLASH’s performance on simulated and real data and show that it yields effective early stopping for both clinical trials and A/B tests.

Survival Instinct in Offline Reinforcement Learning

In offline reinforcement learning (RL), an agent optimizes its performance given an offline dataset. Survival Instinct in Offline Reinforcement Learning presents a novel observation: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL’s return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design.

This research demonstrates that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and a certain bias implicit in common data collection practices. This work shows that this pessimism endows the agent with a “survival instinct”, i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. The researchers argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones.

Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics

Molecular dynamics (MD) is a well-established technique for simulating physical systems at the atomic level. When performed accurately, it provides unrivalled insight into the detailed mechanics of molecular motion, without the need for wet lab experiments. MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution (opens in new tab). However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed from scratch for each molecular system studied.

Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics presents an enhanced sampling method which uses a normalizing flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of 10^5−10^6fs. Crucially, Timewarp is transferable between molecular systems: the researchers show that, once trained, Timewarp generalizes to unseen small peptides (2-4 amino acids), exploring their metastable states and providing wall-clock acceleration when sampling compared to standard MD. This new method constitutes an important step towards developing general, transferable algorithms for accelerating MD.

The post NeurIPS 2023 highlights breadth of Microsoft’s machine learning innovation appeared first on Microsoft Research.

Read More

MatterGen: Property-guided materials design

MatterGen: Property-guided materials design

MatterGen

Generative AI has revolutionized how we create text and images. How about designing novel materials? We at Microsoft Research AI4Science are thrilled to announce MatterGen, our generative model that enables broad property-guided materials design.

The central challenge in materials science is to discover materials with desired properties, e.g., high Li-ion conductivity for battery materials. Traditionally, this has been done by first finding novel materials and then filtering down based on the application. This is like trying to create the image of a cat by first generating a million different images and then searching for the one with a cat. In MatterGen, we directly generate novel materials with desired properties, similar to how DALL·E 3 tackles image generation.  

MatterGen is a diffusion model specifically designed for generating novel, stable materials. MatterGen also has adapter modules that can be fine-tuned to generate materials given a broad range of constraints, including chemistry, symmetry, and properties. MatterGen generates 2.9 times more stable (≤ 0.1 eV/atom of our training + test data convex hull), novel, unique structures than a SOTA model (CDVAE). It also generates structures 17.5 times closer to energy local minimum. MatterGen can directly generate materials satisfying desired magnetic, electronic, mechanical properties via classifier-free guidance. We verify generated materials with DFT-based workflows. 

Figure 1 (alt text) 

This figure displays six pairs of crystalline structures, two for each property constrain. The property constraints are, top to bottom and left to right, high space group symmetry, high bulk modulus, target chemical system, target band gap, high magnetic density, combined high magnetic density and low HHI index.
Figure 1: Stable and new materials generated by MatterGen while constrained on properties. 

Additionally, MatterGen can keep generating novel materials that satisfy a target property like high bulk modulus while screening methods instead saturate due to the exhaustion of materials in the database.

This is a line plot. The x axis indicates the number of DFT property calculations calls; the y axis reports the number of structures found. The title of the plot says
Figure 2: MatterGen discovers more novel stable high bulk modulus materials than the screening baseline, and does not plateau for increasing computational resources. MatterGen can find more than 250 materials with a bulk modulus > 400 GPa, while only 2 such materials are found in the reference dataset.

MatterGen can also generate materials given target chemical systems. It outperforms substitution and random structure search baselines equipped with MLFF filtering, especially in challenging 5-element systems. MatterGen also generates structures given target space groups. Finally, we tackle the multi-property materials design problem of finding low-supply-chain risk magnets. MatterGen proposes structures that have both high magnetic density and a low supply-chain risk chemical composition. 

We believe MatterGen is an important step forward in AI for materials design. Our results are currently verified via DFT, which has many known limitations. Experimental verification remains the ultimate test for real-word impact, and we hope to follow up with more results. 

None of this would be possible without the highly collaborative work between Andrew Fowler, Claudio Zeni, Daniel Zügner, Matthew Horton, Robert Pinsler, Ryota Tomioka, Tian Xie and our amazing interns Xiang Fu, Sasha Shysheya, and Jonathan Crabbé, as well as Jake Smith, Lixin Sun and the entire AI4Science Materials Design team.  

We are also grateful for all the help from Microsoft Research, AI4Science, and Azure Quantum.

The post MatterGen: Property-guided materials design appeared first on Microsoft Research.

Read More

LLMLingua: Innovating LLM efficiency with prompt compression

LLMLingua: Innovating LLM efficiency with prompt compression

This research paper was presented at the 2023 Conference on Empirical Methods in Natural Language Processing (opens in new tab) (EMNLP 2023), the premier conference on natural language processing and artificial intelligence.

EMNLP 2023 logo to the left of accepted paper

As large language models (LLMs) models advance and their potential becomes increasingly apparent, an understanding is emerging that the quality of their output is directly related to the nature of the prompt that is given to them. This has resulted in the rise of prompting technologies, such as chain-of-thought (CoT) and in-context-learning (ICL), which facilitate an increase in prompt length. In some instances, prompts now extend to tens of thousands of tokens, or units of text, and beyond. While longer prompts hold considerable potential, they also introduce a host of issues, such as the need to exceed the chat window’s maximum limit, a reduced capacity for retaining contextual information, and an increase in API costs, both in monetary terms and computational resources.

To address these challenges, we introduce a prompt-compression method in our paper, “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (opens in new tab),” presented at EMNLP 2023 (opens in new tab). Using a well-trained small language model, such as GPT2-small or LLaMA-7B, LLMLingua identifies and removes unimportant tokens from prompts. This compression technique enables closed LLMs to make inferences from the compressed prompt. Although the token-level compressed prompts may be difficult for humans to understand, they prove highly effective for LLMs. This is illustrated in Figure 1.

This is an illustration of the LLMLingua framework, which estimates the important tokens of a prompt based on a small language model. It consists of three modules: a budget controller, iterative token-level prompt compression, and distribution alignment. The framework can compress a complex prompt of 2,366 tokens down to 117 tokens, achieving a 20x compression while maintaining almost unchanged performance.
Figure 1. LLMLingua’s framework

LLMLingua’s method and evaluation

To develop LLMLingua’s framework, we employed a budget controller to balance the sensitivities of different modules in the prompt, preserving the language’s integrity. Our two-stage process involved course-grained prompt compression. We first streamlined the prompt by eliminating certain sentences and then individually compressed the remaining tokens. To preserve coherence, we employed an iterative token-level compression approach, refining the individual relationships between tokens. Additionally, we fine-tuned the smaller model to capture the distribution information from different closed LLMs by aligning it with the patterns in the LLMs’ generated data. We did this through instruction tuning.

To assess LLMLingua’s performance, we tested compressed prompts on four different datasets, GSM8K, BBH, ShareGPT, and Arxiv-March23, encompassing ICL, reasoning, summarization, and conversation. Our approach achieved impressive results, achieving up to 20x compression while preserving the original prompt’s capabilities, particularly in ICL and reasoning. LLMLingua also significantly reduced system latency.

During our test, we used LLaMA-7B as the small language model and GPT-3.5-Turbo-0301, one of OpenAI’s LLMs, as the closed LLM. The results show that LLMLingua maintains the original reasoning, summarization, and dialogue capabilities of the prompt, even at a maximum compression ratio of 20x, as reflected in the evaluation metric (EM) columns in Tables 1 and 2. At the same time, other compression methods failed to retain key semantic information in prompts, especially in logical reasoning details. For a more in-depth discussion of these results, refer to section 5.2 of the paper.

These are the experimental results on GSM8K and BBH using GPT-3.5-turbo, demonstrating the in-context learning and reasoning capabilities based on different methods and compression constraints. The results show that LLMLingua can achieve up to a 20x compression rate while only experiencing a 1.5-point performance loss.
Table 1. Performance of different methods at different target compression ratios on the GSM8K and BBH datasets.
These are the experimental results for ShareGPT (Conversation) and Arxiv-March23 (Summarization) using GPT-3.5-turbo, based on different methods and compression constraints. The results indicate that LLMLingua can effectively retain the semantic information from the original prompts while achieving a compression rate of 3x-9x.
Table 2. Performance of different methods at different target compression ratios for conversation and summarization tasks.

LLMLingua is robust, cost-effective, efficient, and recoverable

LLMLingua also showed impressive results across various small language models and different closed LLMs. When using GPT-2-small, LLMLingua achieved a strong performance score of 76.27 under the ¼-shot constraint, close to the LLaMA-7B’s result of 77.33 and surpassing the standard prompt results of 74.9. Similarly, even without aligning Claude-v1.3, one of the post powerful LLMs, LLMLingua’s score was 82.61 under the ½-shot constraint, outperforming the standard prompt result of 81.8.

LLMLingua also proved effective in reducing response length, leading to significant reductions in latency in the LLM’s generation process, with reductions ranging between 20 to 30 percent, as shown in Figure 2.

The figure demonstrates the relationship between the compression ratio and the number of response tokens. In different tasks, as the compression ratio increases, the response length decreases to varying extents, with a maximum reduction of 20%-30%.
Figure 2. The distribution of token lengths generated at varying compression ratios.

What makes LLMLingua even more impressive is its recoverability feature. When we used GPT-4 to restore the compressed prompts, it successfully recovered all key reasoning information from the full nine-step chain-of-thought (CoT) prompting, which enables LLMs to address problems through sequential intermediate steps. The recovered prompt was almost identical to the original, and its meaning was retained. This is shown in Tables 3 and 4.

This figure illustrates the original prompt, the compressed prompt, and the result of using GPT-4 to recover the compressed prompt. The original prompt consists of a 9-step Chain-of-Thought, and the compressed prompt is difficult for humans to understand. However, the recovered text includes all 9 steps of the Chain-of-Thought.
Table 3. Latency comparison on GSM8K. LLMLingua can accelerate LLMs’ end-to-end inference by a factor of 1.7–5.7x. 
This figure shows the end-to-end latency when using LLMLingua, without using LLMLingua, and the latency when compressing prompts. As the compression ratio increases, both the LLMLingua and end-to-end latency decrease, achieving up to a 5.7x acceleration with a 10x token compression rate.
Table 4. Recovering the compressed prompt from GSM8K using GPT-4.

Enhancing the user experience and looking ahead

LLMLingua is already proving its value through practical application. It has been integrated into LlamaIndex (opens in new tab), a widely adopted retrieval-augmented generation (RAG) framework. Currently, we are collaborating with product teams to reduce the number of tokens required in LLM calls, particularly for tasks like multi-document question-answering. Here, our goal is to significantly improve the user experience with LLMs. 

For the long-term, we have proposed LongLLMLingua, a prompt-compression technique designed for long-context scenarios, such as retrieval-augmented question-answering tasks in applications like chatbots, useful when information evolves dynamically over time. It’s also geared for tasks like summarizing online meetings. LongLLMLingua’s primary objective is to enhance LLMs’ ability to perceive key information, making it suitable for numerous real-world applications, notably information-based chatbots. We’re hopeful that this innovation paves the way for more sophisticated and user-friendly interactions with LLMs.

Learn more about our work on the LLMLingua (opens in new tab) page.

The post LLMLingua: Innovating LLM efficiency with prompt compression appeared first on Microsoft Research.

Read More

Abstracts: December 6, 2023

Abstracts: December 6, 2023

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Xing Xie, a Senior Principal Research Manager at Microsoft Research, joins host Gretchen Huizinga to discuss “Evaluating General-Purpose AI with Psychometrics.” As AI capabilities move from task specific to more general purpose, the paper explores psychometrics, a subfield of psychology, as an alternative to traditional methods for evaluating model performance and for supporting consistent and reliable systems.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Dr. Xing Xie, a Senior Principal Research Manager at Microsoft Research. Dr. Xie is coauthor of a vision paper on large language models called “Evaluating General-Purpose AI with Psychometrics,” and you can find a preprint of this paper now on arXiv. Xing Xie, thank you for joining us on Abstracts!

XING XIE: Yes, thank you. It’s my pleasure to be here. 

HUIZINGA: So in a couple sentences, tell us what issue or problem your research addresses and why people should care about it. 


XIE: Yeah, in a sense, actually, we are exploring the potential of psychometrics to revolutionize how we evaluate general-purpose AI. Because AI is advancing at a very rapid pace, traditional evaluation methods face significant challenges, especially when it comes to predicting a model’s performance in unfamiliar scenarios. And this method also lacks a robust mechanism to assess their own quality. Additionally, we, in this paper, we delve into the complexity of directly applying psychometrics to this domain and underscore several promising directions for future research. We believe that this research is of great importance. As AI continues to be integrated into novel application scenarios, it could have significant implications for both individuals and society at large. It’s crucial that we ensure their performance is both consistent and reliable.

HUIZINGA: OK, so I’m going to drill in a little bit in case there’s people in our audience that don’t understand what psychometrics is. Could you explain that a little bit for the audience? 

XIE: Yeah, psychometrics could be considered as a subdomain of psychology. Basically, psychology just studies everything about humans, but psychometrics is specifically developed to study how we can better evaluate, we could also call this general-purpose intelligence, but it’s human intelligence. So there are, actually, a lot of methodologies and approaches in how we develop this kind of test and what tasks we need to carry out. The previous AI is designed for specific tasks like machine translation, like summarization. But now I think people are already aware of many progress in big models, in large language models. AI, actually, currently can be considered as some kind of solving general-purpose tasks. Sometimes we call it few-shot learning, or sometimes we call it like zero-shot learning. We don’t need to train a model before we bring new tasks to them. So this brings a question in how we evaluate this kind of general-purpose AI, because traditionally, we evaluate AI usually using some specific benchmark, specific dataset, and specific tasks. This seems to be unsuitable to this new general-purpose AI. 

HUIZINGA: So how does your approach build on and/or differ from what’s been done previously in this field? 

XIE: Yeah, we actually see a lot of efforts have been investigated into evaluating the performance of these new large language models. But we see a significant portion of these evaluations are task specific. They’re still task specific. And also, frankly speaking, they are easily affected by changes. That means even slight alterations to a test could lead to substantial drops in performance. So our methodology differs from these approaches in that rather than solely testing how AI performs on those predetermined tasks, we actually are evaluating those latent constructs because we believe that pinpointing these latent constructs is very important.

HUIZINGA: Yeah. 

XIE: It’s important in forecasting AI’s performance in evolving and unfamiliar contexts. We can use an example like game design. With humans, even if an individual has never worked on game design—it’s just a whole new task for her—we might still confidently infer their potential if we know they possess the essential latent constructs, or abilities, which are important for game design. For example, creativity, critical thinking, and communication. 

HUIZINGA: So this is a vision paper and you’re making a case for using psychometrics as opposed to regular traditional benchmarks for assessing AI. So would you say there was a methodology involved in this as a research paper, and if so, how did you conduct the research for this? What was the overview of it? 

XIE: As you said, this is a vision paper. So instead of describing a specific methodology, we are collaborating with several experienced psychometrics researchers. Collectively, we explore the feasibility of integrating psychometrics into AI evaluation and discerning which concepts are viable and which are not. In February this year, we hosted a workshop on this topic. Over the past months, we have engaged in, in numerous discussions, and the outcome of these discussions is articulated in this paper. And additionally, actually, we are also in the middle of drafting another paper; that paper will apply insights from this paper to devise a rigorous methodology for assessing the latent capability of the most cutting-edge language models. 

HUIZINGA: When you do a regular research paper, you have findings. And when you did this paper and you workshopped it, what did you come away with in terms of the possibilities for what you might do on assessing AI with psychometrics? What were your major findings? 

XIE: Yeah, our major findings can be divided into two areas. First, we underscore the significant potential of psychometrics. This includes exploring how these metrics can be utilized to enhance predictive accuracy and guarantee test quality. Second, we also draw attention to the new challenges that arise when directly applying these principles to AI. For instance, test results could be misinterpreted, as assumptions verified for human tests might not necessarily apply to AI. Furthermore, capabilities that are essential for humans may not hold the same importance for AI.

HUIZINGA: Hmm …  

XIE: Another notable challenge is the lack of a consistent and defined population of AI, especially considering their rapid evolution. But this population is essential for traditional psychometrics, and we need to have a population of humans for verifying either the reliability or the validity of a test. But for AI, this becomes a challenge. 

HUIZINGA: Based on those findings, how do you think your work is significant in terms of real-world impact at this point? 

XIE: We believe that our approach will signal the start of a new era in the evaluation of general-purpose AI, shifting from earlier, task-specific methodologies to a more rigorous scientific method. Fundamentally, there’s an urgent demand to establish a dedicated research domain focusing solely on AI evaluation. We believe psychometrics will be at the heart of this domain. Given AI’s expanding role in society and its growing significance as an indispensable assistant, this evolution will be crucial. I think one missing part of current AI evaluation is how we can make sure the test, the benchmark, or these evaluation methods of AI themselves, is scientific. Actually, previously, I used the example of game design. Suppose in the future, I think there are a lot of people discussing language model agents, AI agents … they could be used to not only write in code but also develop software by collaborating among different agents. Then what kind of capabilities, or we call them latent constructs, of these AI models they should have before they make success in game design or any other software development. For example, like creativity, critical thinking, communication. Because this could be important when there are multiple AI models—they communicate with each other, they check the result of the output of other models. 

HUIZINGA: Are there other areas that you could say, hey, this would be a relevant application of having AI evaluated with psychometrics instead of the regular benchmarks because of the generality of intelligence?

XIE: We are mostly interested in maybe doing research, because a lot of researchers have started to leverage AI for their own research. For example, not only for writing papers, not only for generating some ideas, but maybe they could use AI models for more tasks in the whole pipeline of research. So this may require AI to have some underlying capabilities, like, as we have said, like critical thinking—how AI should define the new ideas and how they check whether these ideas are feasible and how they propose creative solutions and how they work together on research. This could be another domain. 

HUIZINGA: So if there was one thing that you want our listeners to take away from this work, what would it be? 

XIE: Yeah, I think the one takeaway I want to say is we should be aware of the vital importance of AI evaluation. We are still far from achieving a truly scientific standard, so we need to still work hard to get that done. 

HUIZINGA: Finally, what unanswered questions or unsolved problems remain in this area? What’s next on your research agenda that you’re working on? 

XIE: Yeah, actually, there are a lot of unanswered questions as highlighted at the later part of this paper. Ultimately, our goal is to adapt psychometric theories and the techniques to fit AI contexts. So we have discussed with our collaborators in both AI and psychometrics … some examples would be, how can we develop guidelines, extended theories, and techniques to ensure a rigorous evaluation that prevents misinterpretation? And how can we best evaluate assistant AI and the dynamics of AI-human teaming? This actually is particularly proposed by one of our collaborators in the psychometrics domain. And how do we evaluate the value of general-purpose AI and ensure their alignment with human objectives? And then how can we employ semiautomatic methods to develop psychometric tests, theories, and techniques with the help of general-purpose AI? That means we use AI to solve these problems by themselves. This is also important because, you know, psychometrics or psychology have developed for hundreds, or maybe thousands, of years to come to all the techniques today. But can we shorten that period? Can we leverage AI to speed up this development? 

HUIZINGA: Would you say there’s wide agreement in the AI community that this is a necessary direction to head?

XIE: This is only starting. I think there are several papers discussing how we can apply some part of psychology or some part of psychometrics to AI. But there is no systematic discussion or thinking along this line. So I, I don’t think there is agreement, but there’s already initial thoughts and initial perspectives showing in the academic community. 

[MUSIC PLAYS]

HUIZINGA: Well, Xing Xie, thanks for joining us today, and to our listeners, thank you for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts (opens in new tab), or you can find a preprint of the paper on arXiv. See you next time on Abstracts!

The post Abstracts: December 6, 2023 appeared first on Microsoft Research.

Read More

Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow

Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow

These research papers were presented at the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (opens in new tab) (ESEC/FSE 2023), a premier conference in the field of software engineering.

ESEC/FSE 2023
Two papers on a blue/green gradient: InterFix and AdaptivePaste

The practice of software development inevitably involves the challenge of handling bugs and various coding irregularities. These issues can become pronounced when developers engage in the common practice of copying and pasting code snippets from the web or other peer projects. While this approach might offer a quick solution, it can introduce a host of potential complications, including compilation issues, bugs, and even security vulnerabilities into the developer’s codebase.

To address this, researchers at Microsoft have been working to advance different aspects of the software development lifecycle, from code adaptation to automated bug detection and repair. At ESEC/FSE 2023 (opens in new tab), we introduced two techniques aimed at enhancing coding efficiency. AdaptivePaste utilizes a learning-based approach to adapt and refine pasted code snippets in an integrated development environment (IDE). InferFix is an end-to-end program repair framework designed to automate bug detection and resolution. This blog outlines these technologies.

Microsoft Research Podcast

AI Frontiers: Models and Systems with Ece Kamar

Ece Kamar explores short-term mitigation techniques to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 


AdaptivePaste: Intelligent copy-paste in IDE

A widespread practice among developers involves adapting pasted code snippets to specific use cases. However, current code analysis and completion techniques, such as masked language modeling and CodeT5, do not achieve an acceptable level of accuracy in identifying and adapting variable identifiers within these snippets to align them with the surrounding code. In the paper, “AdaptivePaste: Intelligent Copy-Paste in IDE,” we propose a learning-based approach to source code adaptation, aiming to capture meaningful representations of variable usage patterns. First, we introduce a specialized dataflow-aware de-obfuscation pretraining objective for pasted code snippet adaptation. Next, we introduce a transformer-based model of two variants: a traditional unidecoder and parallel-decoder model with tied weights.

Diagram depicting AdaptivePaste architecture. Starting with a program with a pasted code snippet, AdaptivePaste extracts and prioritizes syntax hierarchies most relevant for the learning task, analyzes the data-flow, and then anonymizes the pasted code. The resulting program serves as input for neural model. The output is serialized as a sequence of tokens.
Figure 1. AdaptivePaste architecture. For a program with a pasted code snippet, AdaptivePaste extracts and prioritizes syntax hierarchies most relevant for the learning task, analyzes the data flow, and anonymizes variable identifiers in the pasted code snippet. The resulting program serves as input for neural model. The output is serialized as a sequence of tokens entries.

The unidecoder follows a standard autoregressive decoder formulation, mapping each variable in the pasted snippet to a unique symbol in the context or declaring a new variable. The parallel decoder duplicates the decoder for each anonymized symbol in the anonymized pasted snippet, predicting names independently and factorizing the output distribution per symbol. This enables selective code snippet adaptation by surfacing model predictions above a specified threshold and outputting “holes” where uncertainty exists.

To establish a dataflow-aware de-obfuscation pretraining objective for pasted code snippet adaptation, we assigned mask symbols to variable identifiers at the granularity of whole code tokens. The pre-existing code context was unanonymized, allowing the model to attend to existing identifier names defined in scope.

Our evaluation of AdaptivePaste showed promising results. It successfully adapted Python source code snippets with 67.8 percent exact match accuracy. When we analyzed the impact of confidence thresholds on model predictions, we observed that the parallel decoder transformer model improves precision to 85.9 percent in a selective code adaptation setting.

InferFix: End-to-end program repair with LLMs

Addressing software defects accounts for a significant portion of development costs. To tackle this, the paper, “InferFix: End-to-End Program Repair with LLMs over Retrieval-Augmented Prompts,” introduces a program repair framework that combines the capabilities of a state-of-the-art static analyzer called Infer, a semantic retriever model called Retriever, and a transformer-based model called Generator to address crucial security and performance bugs in Java and C#.

The Infer static analyzer is used to reliably detect, classify, and locate critical bugs within complex systems through formal verification. The Retriever uses a transformer encoder model to search for semantically equivalent bugs and corresponding fixes in large datasets of known bugs. It’s trained using a contrastive learning objective to excel at finding relevant examples of the same bug type.

The Generator employs a 12 billion-parameter codex model, fine-tuned on supervised bug-fix data. To enhance its performance, the prompts provided to the Generator are augmented with bug type annotations, bug contextual information, and semantically similar fixes retrieved from an external nonparametric memory by the Retriever. The Generator generates the candidate to fix the bug.

Diagram depicting the InferFix approach workflow. Starting with a Pull Request, the Infer Static Analyzer conducts bug detection, classification, and localization. Subsequently, Context Extraction gathers pertinent details of the bugs and the surrounding context, and then Retriever identifies semantically similar bugs. The process concludes with the LLM Generator proposing a fix based on the generated prompt.
Figure 2: The InferFix workflow. An error-prone code modification is detected by the Infer static analyzer, which is used to craft a prompt with bug type annotation, location information, relevant syntax hierarchies, and similar fixes identified by the Retriever. The large language model (LLM) Generator provides a candidate fix to the developer.

To test InferFix, we curated a dataset called InferredBugs (opens in new tab), which is rich in metadata and comprises bugs identified through executing the Infer static analyzer on thousands of Java and C# repositories. The results are noteworthy. InferFix outperforms strong LLM baselines, achieving a top-1 accuracy of 65.6 percent in C# and an impressive 76.8 percent in Java on the InferredBugs dataset.

Looking ahead

With AdaptivePaste and InferFix, we hope to significantly streamline the coding process, minimizing errors and enhancing efficiency. This includes reducing the introduction of bugs when code snippets are added and providing automated bug detection, classification, and patch validation. We believe that these tools hold promise for an enhanced software development workflow, leading to reduced costs and an overall boost in project efficiency.

Looking ahead, the rapid advancement of LLMs like GPT-3.5 and GPT-4 has sparked our interest in exploring ways to harness their potential in bug management through prompt engineering and other methods. Our goal is to empower developers by streamlining the bug detection and repair process, facilitating a more robust and efficient development environment.

The post Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow appeared first on Microsoft Research.

Read More