AI Frontiers: Rethinking intelligence with Ashley Llorens and Ida Momennejad

AI Frontiers: Rethinking intelligence with Ashley Llorens and Ida Momennejad

photo of Ida Momennejad for the AI Frontiers Microsoft Research Podcast series

Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come. 

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity. 

This episode features Principal Researcher Ida Momennejad. Momennejad is applying her expertise in cognitive neuroscience and computer science to better understand—and extend—AI capabilities, particularly when it comes to multistep reasoning and short- and long-term planning. Llorens and Momennejad discuss the notion of general intelligence in both humans and machines; how Momennejad and colleagues leveraged prior research into the cognition of people and rats to create prompts for evaluating large language models; and the case for the development of a “prefrontal cortex” for AI.

Transcript

[MUSIC PLAYS]

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. In this podcast series, I share conversations with fellow researchers about the latest developments in AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.

Today, I’ll speak with Ida Momennejad. Ida works at Microsoft Research in New York City at the intersection of machine learning and human cognition and behavior. Her current work focuses on building and evaluating multi-agent AI architectures, drawing from her background in both computer science and cognitive neuroscience. Over the past decade, she has focused on studying how humans and AI agents build and use models of their environment.


[MUSIC FADES]

Let’s dive right in. We are undergoing a paradigm shift where AI models and systems are starting to exhibit characteristics that I and, of course, many others have described as more general intelligence. When I say general in this context, I think I mean systems with abilities like reasoning and problem-solving that can be applied to many different tasks, even tasks they were not explicitly trained to perform. Despite all of this, I think it’s also important to admit that we—and by we here, I mean humanity—are not very good at measuring general intelligence, especially in machines. So I’m excited to dig further into this topic with you today, especially given your background and insights into both human and machine intelligence. And so I just want to start here: for you, Ida, what is general intelligence?

IDA MOMENNEJAD: Thank you for asking that. We could look at general intelligence from the perspective of history of cognitive science and neuroscience. And in doing so, I’d like to mention its discontents, as well. There was a time where general intelligence was introduced as the idea of a kind of intelligence that was separate from what you knew or the knowledge that you had on a particular topic. It was this general capacity to acquire different types of knowledge and reason over different things. And this was at some point known as g, and it’s still known as g. There have been many different kinds of critiques of this concept because some people said that it’s very much focused on the idea of logic and a particular kind of reasoning. Some people made cultural critiques of it. They said it’s very Western oriented. Others said it’s very individualistic. It doesn’t consider collective or interpersonal intelligence or physical intelligence. There are many critiques of it. But at the core of it, there might be something useful and helpful. And I think the useful part is that there could be some general ability in humans, at least the way that g was intended initially, where they can learn many different things and reason over many different domains, and they can transfer ability to reason over a particular domain to another. And then in the AGI, or artificial general intelligence, notion of it, people took this idea of many different abilities or skills for cognitive and reasoning and logic problem-solving at once. There have been different iterations of what this means in different times. In principle, the concept in itself does not provide the criteria on its own. Different people at different times provide different criteria for what would be the artificial general intelligence notion. Some people say that they have achieved it. Some people say we are on the brink of achieving it. Some people say we will never achieve it. However, there is this idea, if you look at it from an evolutionary and neuroscience and cognitive neuroscience lens, that in evolution, intelligence has evolved multiple times in a way that is adaptive to the environment. So there were organisms that needed to be adaptive to the environment where they were, that intelligence has evolved in multiple different species, so there’s not one solution to it, and it depends on the ecological niche that that particular species needed to adapt to and survive in. And it’s very much related to the idea of being adaptive of certain kinds of, different kinds of problem-solving that are specific to that particular ecology. There is also this other idea that there is no free lunch and the no-free-lunch theorem, that you cannot have one particular machine learning solution that can solve everything. So the idea of general artificial intelligence in terms of an approach that can solve everything and there is one end-to-end training that can be useful to solve every possible problem that it has never seen before seems a little bit untenable to me, at least at this point. What does seem tenable to me in terms of general intelligence is if we understand and study, the same way that we can do it in nature, the foundational components of reasoning, of intelligence, of different particular types of intelligence, of different particular skills—whether it has to do with cultural accumulation of written reasoning and intelligence skills, whether it has to do with logic, whether it has to do with planning—and then working on the particular types of artificial agents that are capable of putting these particular foundational building blocks together in order to solve problems they’ve never seen before. A little bit like putting Lego pieces together. So to wrap it up, to sum up what I just said, the idea of general intelligence had a more limited meaning in cognitive science, referring to human ability to have multiple different types of skills for problem-solving and reasoning. Later on, it was also, of course, criticized in terms of the specificity of it and ignoring different kinds of intelligence. In AI, this notion has been having many different kinds of meanings. If we just mean it’s a kind of a toolbox of general kinds of intelligence for something that can be akin to an assistant to a human, that could make sense. But if we go too far and use it in the kind of absolute notion of general intelligence, as it has to encompass all kinds of intelligence possible, that might be untenable. And also perhaps we shouldn’t think about it in terms of a lump of one end-to-end system that can get all of it down. Perhaps we can think about it in terms of understanding the different components that we have also seen emerge in evolution in different species. Some of them are robust across many different species. Some of them are more specific to some species with a specific ecological niche or specific problems to solve. But I think perhaps it could be more helpful to find those cognitive and other interpersonal, cultural, different notions of intelligence; break them down into their foundational building blocks; and then see how a particular artificial intelligence agent can bring together different skills from this kind of a library of intelligence skills in order to solve problems it’s never seen before.

LLORENS: There are two concepts that jump out at me based on what you said. One is artificial general intelligence and the other is humanlike intelligence or human-level intelligence. And you’ve referenced the fact that, you know, oftentimes, we equate the two or at least it’s not clear sometimes how the two relate to each other. Certainly, human intelligence has been an important inspiration for what we’ve done—a lot of what we’ve done—in AI and, in many cases, a kind of evaluation target in terms of how we measure progress or performance. But I wonder if we could just back up a minute. Artificial general intelligence and humanlike, human-level intelligence—how do these two concepts relate to you?

MOMENNEJAD: Great question. I like that you asked to me because I think it would be different for different people. I’ve written about this, in fact. I think humanlike intelligence or human-level intelligence would require performance that is similar to humans, at least behaviorally, not just in terms of what the agent gets right, but also in terms of the kinds of mistakes and biases that the agent might have. It should look like human intelligence. For instance, humans show primacy bias, recency bias, variety of biases. And this seems like it’s unhelpful in a lot of situations. But in some situations, it helps to come with fast and frugal solutions on the go. It helps to summarize certain things or make inferences really fast that can help in human intelligence, for instance. There is analogical reasoning. That is, there are different types of intelligence that humans do. Now, if you look at what are tasks that are difficult and what are tasks that are easier for humans and compare that to a, for instance, let’s say just a large language model like GPT-4, you will see whether they find similar things simple and similar things difficult or not. When they don’t find similar things easy or difficult, I think that we should not say that this is humanlike per se, unless we mean for a specific task. Perhaps on specific sets of tasks, an agent can be, can have human-level or humanlike intelligent behavior; however, if we look overall, as long as there are particular skills that are more or less difficult for one or the other, it might be not reasonable to compare them. That being said, there are many things that some AI agent and even a [programming] language would be better [than] humans at. Does that mean that they are generally more intelligent? No, it doesn’t because there are also many things that humans are far better than AI at. The second component of this is the mechanisms by which humans do the intelligent things that we do. We are very energy efficient. With very little amount of energy consumption, we can solve very complicated problems. If you put some of us next to each other or at least give a pen and paper to one of us, this can be even a lot more effective; however, the amount of energy consumption that it takes in order for any machine to solve similar problems is a lot higher. So another difference between humanlike intelligence or biologically inspired intelligence and the kind of intelligence that is in silico is efficiency, energy efficiency in general. And finally, the amount of data that goes into current state-of-[the-art] AI versus perhaps the amount of data that a human might need to learn new tasks or acquire new skills seem to be also different. So it seems like there are a number of different approaches to comparing human and machine intelligence and deriving what are the criteria for a machine intelligence to be more humanlike. But other than the conceptual aspect of it, it’s not clear that we necessarily want something that’s entirely humanlike. Perhaps we want in some tasks and in some particular use cases for the agent to be humanlike but not in everything.

LLORENS: You mentioned some of the ways in which human intelligence is inferior or has weaknesses. You mentioned some of the weaknesses of human intelligence, like recency bias. What are some of the weaknesses of artificial intelligence, especially frontier systems today? You’ve recently published some works that have gotten into new paradigms for evaluation, and you’ve explored some of these weaknesses. And so can you tell us more about that work and about your view on this?

MOMENNEJAD: Certainly. So inspired by a very long-standing tradition of evaluating cognitive capacities—those Lego pieces that bring together intelligence that I was mentioning in humans and animals—I have conducted a number of experiments, first in humans, and built reinforcement learning models over the past more than a decade on the idea of multistep reasoning and planning. It is in the general domain of reasoning, planning, and decision making. And I particularly focused on what kind of memory representations allow brains and reinforcement learning models inspired by human brain and behavior to be able to predict the future and plan the future and reason over the past and the future seamlessly using the same representations. Inspired by the same research that goes back in tradition to Edward Tolman’s idea of cognitive maps and latent learning in the early 20th century, culminating in his very influential 1948 paper, “Cognitive maps in rats and men,” I sat down with a couple of colleagues last year—exactly this time, probably—and we worked on figuring out if we can devise similar experiments to that in order to test cognitive maps and planning and multistep reasoning abilities in large language models. So I first turned some of the experiments that I had conducted in humans and some of the experiments that were done by Edward Tolman on the topic in rodents and turned them into prompts for ChatGPT. That’s where I started, with GPT-4. The reason I did that was that I wanted to make sure that I will create some prompts that have not been in the training set. My experiments, although the papers have been published, the stimuli of the experiments were not linguistic. They were visual sequences that the human would see, and they would have to have some reinforcement learning and learn from the sequences to make inferences about relationships between different states and find what is the path that would give them optimal rewards. Very simple human reinforcement learning paradigms, however, with different kind of structures. The inspirations that I had drawn from the cognitive maps works by Edward Tolman and others was in this idea that in order for a creature, whether it’s a rodent, a human, or a machine, to be able to reason in [multiple] steps, plan, and have cognitive maps, which is simply a representation of the relational structure of the environment, in order for a creature to have these abilities or these capacities, it means that the creature needs to be sensitive and adaptive to local changes in the environment. So I designed the, sort of, the initial prompts and recruited a number of very smart and generous-with-their-time colleagues, who … we sat together and created these prompts in different domains. For instance, we also created social prompts. We also created the same kind of graph structures but for reasoning over social structures. For instance, I say, Ashley’s friends with Matt. Matt is friends with Michael. If I want to pass a message to Michael, what is the path that I can choose? Which would be, I have to tell Ashley. Ashley will tell Matt. Matt will tell Michael. This is very similar to another paradigm which was more like a maze, which would be similar to saying, there is a castle; it has 16 rooms. You enter Room 1. You open the door. It opens to Room 2. In Room 2, you open the door, and so on and so forth. So you describe, using language, the structure of a social environment or the structure of a spatial environment, and then you ask certain questions that have to do with getting from A to B in this social or spatial environment from the LLM, or you say, oh, you know, Matt and Michael don’t talk to each other anymore. So now in order to pass a message, what should I do? So I need to find a detour. Or, for instance, I say, you know, Ashley has become close to Michael now. So now I have a shortcut, so I can directly give the message to Ashley, and Ashley can directly give the message to Michael. My path to Michael is shorter now. So finding things like detours, shortcuts, or if the reward location changes, these are the kinds of changes that, inspired by my own past work and inspired by the work of Tolman and others, we implemented in all of our experiments. This led to 15 different tasks for every single graph, and we have six graphs total of different complexity levels with different graph theoretic features, and [for] each of them, we had three domains. We had a spatial domain that was with rooms that had orders like Room 1, Room 2, Room 3; a spatial domain that there was no number, there was no ordinal order to the rooms; and a social environment where it was the names of different people and so the reasoning was over social, sort of, spaces. So you can see this is a very large number of tasks. It’s 6 times 15 times 3, and each of the prompts we ran 30 times for different temperatures. Three temperatures: 0, 0.5, and 1. And for those who are not familiar with this, a temperature of a large language model determines how random it will be or how much it will stick to the first or the best option that comes to it at the last layer. And so when there are some problems that may be the first obvious answer that it finds are not good, perhaps increasing the temperature could help, or perhaps a problem that needs precision, increasing the temperature would make it worse. So based on these ideas, we also tried it for different temperatures. And we tested eight different language models like this in order to systematically evaluate their ability for this multistep reasoning and planning, and the framework that we use—we call it CogEval—and CogEval is a framework that’s not just for reasoning and multistep planning. Other tasks can be used in this framework in order to be tested, as well. And the first step of it is always to operationalize the cognitive capacity in terms of many different tasks like I just mentioned. And then the second task is designing the specific experiments with different domains like spatial and social; with different structures, like the graphs that I told you; and with different kind of repetitions and with different tasks, like the detour, shortcut, and the reward revaluation, transition revaluation, and just traversal, all the different tasks that I mentioned. And then the third step is to generate many prompts and then test them with many repetitions using different temperatures. Why is that? I think something that Sam Altman had said is relevant here, which is sometimes with some problems, you ask GPT-4 a hundred times, and one out of those hundred, it would give the correct answer. Sometimes 30 out of a hundred, it will give the correct answer. You obviously want it to give hundred out of hundred the correct answer. But we didn’t want to rely on just one try and miss the opportunity to see whether it could give the answer if you probed it again[1]. And in all of the eight large language models, we saw that none of the large language models was robust to the graph structure. Meaning, its performance got really worse as soon as the graph structure, [which] didn’t even have many nodes but just had a tree structure that was six or seven nodes, or a six- or seven-node tree was much more difficult for it to solve than a graph that had 15 nodes but had a simpler structure that was just two lines. We noted that sometimes, counterintuitively, some graph structures that you think should be easy to solve were more difficult for them. On the other hand, they were not robust to the task set. So the specific task that we tried, whether it was detour, shortcut, or it was reward revaluation or traversal, it mattered. For instance, shortcut and detour were very difficult for all of them. Another thing that we noticed was that all of them, including GPT-4, hallucinated paths that didn’t exist. For instance, there was no door between Room 12 and Room 16. They would hallucinate that there is a door, and they would give a response that includes that door. Another kind of failure mode that we observed was that they would fail to even find a one-step path. Let’s say between Room 7 and 8, there is a direct door. We would say, what is the path from 7 and 8? And they would take a longer path to go from it. And a final mode that we observed was that they would sometimes fall in loops. Even though we would directly ask them to find the shortest path, they would sometimes fall into a loop on the way to getting to their destination, which obviously you shouldn’t do if you are trying to find the shortest path. That said, there is two differing notions of accuracy here. You can have satisficing, which means you get there; you just take a longer path. And there is this notion that you cannot get there because you used some imaginary path or you did something that didn’t make sense and you, sort of, gave a nonsensical response. We had both of those kinds of issues, so we had a lot of issues with giving nonsensical answers, repeating the question that we were asking, producing gibberish. So there were numerous kinds of challenges. What we did observe was that GPT-4 was far better than the other LLMs in this regard, at least at the time that we tested it; however, this is obviously on the basis of the particular kinds of tasks that we tried. In another study, we tried Tower of Hanoi, which is also a classic cognitive science approach to [testing] planning abilities and hierarchical planning abilities. And we found that GPT-4 does between zero and 10 percent in the three-disk problem and zero percent for the four-disk problem. And that is when we started to think about having more brain-inspired solutions to improve that approach. But I’m going to leave that for next.

LLORENS: So it sounds like a very extensive set of experiments across many different tasks and with many different leading AI models, and you’ve uncovered a lack of robustness across some of these different tasks. One curiosity that I have here is how would you assess the relative difficulty of these particular tasks for human beings? Would all of these be relatively easy for a person to do or not so much?

MOMENNEJAD: Great question. So I have conducted some of these experiments already and have published them before. Humans do not perform symmetrically on all these tasks, for sure; however, for instance, Tower of Hanoi is a problem that we know humans can solve. People might have seen this. It’s three little rods that are … usually, it’s a wooden structure, so you have a physical version of it, or you can have a virtual version of it, and there are different disks with different colors and sizes. There are some rules. You cannot put certain disks on top of others. So there is a particular order in which you can stack the disks. Usually what happens is that all the disks are on one side—and when I say a three-disk problem, it means you have three total disks. And there is usually a target solution that you are shown, and you’re told to get there in a particular number of moves or in a minimum number of moves without violating the rules. So in this case, the rules would be that you wouldn’t put certain disks on top of others. And based on that, you’re expected to solve the problem. And the performance of GPT-4 on Tower of Hanoi three disk is between 0 to 10 percent and on Tower of Hanoi four disks is zero percent—zero shot. With the help, it can get better. With some support, it gets better. So in this regard, it seems like Tower of Hanoi is extremely difficult for GPT-4. It doesn’t seem as difficult as it is for GPT-4 for humans. It seems for some reason, that it couldn’t even improve itself when we explained the problem even further to it and explain to it what it did wrong. Sometimes—if people want to try it out, they should—sometimes, it would argue back and say, “No, you’re wrong. I did this right.” Which was a very interesting moment for us with ChatGPT. That was the experience that we had for trying it out first without giving it, sort of, more support than that, but I can tell you what we did next, but I want to make sure that we cover your other questions. But just to wrap this part up, inspired by tasks that have been used for evaluation of cognitive capacities such as multistep reasoning and planning in humans, it is possible to evaluate cognitive capacities and skills such as multistep reasoning and planning also in large language models. And I think that’s the takeaway from this particular study and from this general cognitive science–inspired approach. And I would like to say also it is not just human tasks that are useful. Tolman’s tasks were done in rodents. A lot of people have done experiments in fruit flies, in C. elegans, in worms, in various kinds of other species that are very relevant to testing, as well. So I think there is a general possibility of testing particular intelligence skills, evaluating it, inspired by experiments and evaluation methods for humans and other biological species.

LLORENS: Let’s explore the way forward for AI from your perspective. You know, as you’ve described your recent works, it’s clear that you have, that your work is deeply informed by insights from cognitive science, insights from neuroscience, and recent works—your recent works—have called for the development, for example, of a prefrontal cortex for AI, and I understand this to be the part of the brain that facilitates executive function. How does, how does this relate to the, you know, extending the capabilities of AI, a prefrontal cortex for AI?

MOMENNEJAD: Thank you for that question. So let me start by reiterating something I said earlier, which is the brain didn’t evolve in a lump. There were different components of brains and nervous systems and neurons that evolved at different evolutionary scales. There are some parts of the brain that appear in many different species, so they’re robust across many species. And there are some parts of the brain that appear in some species that had some particular needs, some particular problems they were facing, or some ecological niche. What is, however, in common in many of them is that there seems to be some kind of a modular or multicomponent aspect to what we call higher cognitive function or what we call executive function. And so the kinds of animals that we ascribe some form of executive function of sorts to seem to have brains that have parts or modules that do different things. It doesn’t mean that they only do that. It’s not a very extreme Fodorian view of modularity. But it is the view that, broadly speaking, when, for instance, we observe patients that have damage to a particular part of their prefrontal cortex, it could be that they perform the same on an IQ test, but they have problems holding their relationship or their jobs. So there are different parts of the brain that selective damage to those areas, because of accidents or coma or such, it seems to impair specific cognitive capacities. So this is what very much inspired me. I have been investigating the prefrontal cortex for, I guess, 17 years now, [LAUGHS] which is a scary number to say. But been … basically since I started my PhD and even during my master’s thesis, I have been focused on the role of the prefrontal cortex in our ability for long-term reasoning and planning in not just this moment—long-term, open-ended reasoning and planning. Inspired by this work, I thought, OK, if I want to improve GPT-4’s performance on, let’s say, Tower of Hanoi, can we get inspired by this kind of multiple roles that different parts of the brain play in executive function, specifically different parts of the neocortex and specifically different parts of the prefrontal cortex, part of the neocortex, in humans? Can we get inspired by some of these main roles that I have studied before and ask GPT-4 to play the role of those different parts and solve different parts of the planning and reasoning problem—the multistep planning and reasoning problem—using these roles and particular rules of how to iterate over them. For instance, there is a part of the brain called anterior cingulate cortex. Among other things, it seems to be involved in monitoring for errors and signaling when there is a need to exercise more control or move from what people like to call a faster way of thinking to a slower way of thinking to solve a particular problem. And there is … so let’s call this the cognitive function of this part. Let’s call it the monitor. This is a part of the brain that monitors for when there is a need for exercising more control or changing something because there is an error maybe. There is another part of the brain and the frontal lobe that is the, for instance, dorsolateral prefrontal cortex; that one is involved in working memory and coming up with, like, simpler plans to execute. Then there is the ventromedial prefrontal cortex that is involved in the value of states and predicting what is the next state and integrating it with information from other parts of the brain to figure out what is the value. So you put all of these things together, you can basically write different algorithms that have these different components talking to each other. And we have in that paper also, written in a pseudocode style, the different algorithms that are basically akin to a tree search, in fact. So there is a part of the role … they’re part of the multicomponent or multi-agent realization of a prefrontal cortex-like GPT-4 solution. One part of it would propose a plan. The monitor would say, thanks for that; let me pass it on to the part that is evaluating what is the outcome of this and what’s the value of that, and get back to you. It evaluates there and comes back and says, you know, this is not a good plan; give me another one. And in this iteration, sometimes it takes 10 iterations; sometimes it takes 20 iterations. This kind of council of different types of roles, they come up with a solution that is solving the Tower of Hanoi problem. And we managed to bring the performance from 0 to 10 [percent] in GPT-4 to, I think, about 70—70 percent—in Tower of Hanoi three disks, and OOD, or out-of-distribution generalization, without giving any examples of a four disk, it could generalize to above 20 percent in four-disk problems. Another impressive thing that happened here—and we tested it on the CogEval and the planning tasks from the other experiment, too—was that it brought all of the, sort of, hallucinations from about 20 to 30 percent—in some cases, much higher percentages—to zero percent. So we had slow thinking; we had 30 iterations, so it took a lot longer. And if this is, you know, fast and slow thinking, this is very slow thinking. However, we had no hallucinations anymore. And hallucination in Tower of Hanoi would be making a move that is impossible. For instance, putting in a, kind of, a disk on top of another that you cannot do because you violate a rule or taking out a middle disk that you cannot pull out actually. So those would be the kinds of hallucinations in Tower of Hanoi. All of those also went to zero. And so that is one thing that we have done already, which I have been very excited about.

LLORENS: So you painted a pretty interesting—fascinating, really—picture of a multi-agent framework where different instances of an advanced model like GPT-4 would be prompted to play the roles of different parts of the brain and, kind of, work together. And so my question is a pragmatic one. How do you prompt GPT-4 to play the role of a specific part of the human brain? What does that prompt look like?

MOMENNEJAD: Great question. I can actually, well, we have all of that at the end of our paper, so I can even read some of them if that was of interest. But just a quick response to that is you can basically describe the function that you want the LLM—in this case GPT-4—to play. You can write that in simple language. You don’t have to tell it that this is inspired by the brain. It is completely sufficient to just basically provide certain sets of rules in order for it, in order to be able to do that.[2] For instance, after you provide the problem, sort of, description … let me see if I can actually read some part of this for you. For instance, you give it a problem, and you say, consider this problem. Rule 1: you can only move a number if it’s at this and that. You clarify the rules. Here are examples. Here are proposed moves. And then you say, for instance, your role is to find whether this particular number generated as a solution is accurate. In order to do that, you can call on this other function, which is the predictor and evaluator that says, OK, if I do this, what state do I end up in, and what is the value of that state? And you get that information, and then based on that information, you decide whether the proposed move for this problem is a good move or not. If it is, then you pass a message that says, all right, give me the next step of the plan. If it’s not, then you say, OK, this is not a good plan; propose another plan. And then the part of, the part that plays the role of, hey, here is the problem. Here are the rules. Propose the first towards the subgoal or find the subgoal towards this and propose the next step. And that one receives this feedback from the monitor. And monitor has asked the predictor and evaluator, hey, what happens if I do these things and what would be the value of that in order to say, hey, this is not a great idea. So in a way this becomes a very simple prefrontal cortex–inspired multi-agent system. All of them are within the same … sort of, different calls to GPT-4 but the same instance. Just, like, because we were calling it in a code, it’s just, you just call, it’s called multiple times and each time with this kind of a very simple in-context learning text that, in text, it describes, hey, here’s the kind of problem you’re going to see. Here’s the role I want you to play. And here is what other kind of rules you need to call in order to play your role here. And then it’s up to the LLM to decide how many times it’s going to call which components in order to solve the problem. We don’t decide. We can only decide, hey, cap it at 10 times, for instance, or cap it at 30 iterations and then see how it performs.

LLORENS: So, Ida, what’s next for you and your research?

MOMENNEJAD: Thank you for that. I have always been interested in understanding minds and making minds, and this has been something that I’ve wanted to do since I was a teenager. And I think that my approaches in cognitive neuroscience have really helped me to understand minds to the extent that is possible. And my understanding of how to make minds comes from basically the work that I’ve done in AI and computer science since my undergrad. What I would be interested in is—and I have learned over the years that you cannot think about the mind in general when you are trying to isolate some components and building them—is that my interest is very much in reasoning and multistep planning, especially in complex problems and very long-term problems and how they relate to memory, how the past and the future relate to one another. And so something that I would be very interested in is making more efficient types of multi-agent brain-inspired AI but also to train smaller large language models, perhaps using the process of reasoning in order to improve their reasoning abilities. Because it’s one thing to train on outcome and outcome can be inputs and outputs, and that’s the most of the training data that LLMs receive. But it’s an entirely different approach to teach the process and probe them on different parts of the process as opposed to just the input and output. So I wonder whether with that kind of an approach, which would require generating a lot of synthetic data that relates to different types of reasoning skills, whether it’s possible to teach LLMs reasoning skills, and by reasoning skills, I mean very clearly operationalized—similar to the CogEval approach—operationalized, very well-researched, specific cognitive constructs that have construct validity and then operationalizing them in terms of many tasks. And something that’s important to me is a very important idea and a part of intelligence that maybe I didn’t highlight enough in the first part is being able to transfer to tasks that they have never seen before, and they can piece together different intelligence skills or reasoning skills in order to solve them. Another thing that I have done and I will continue to do is collective intelligence. So we talked about multi-agent systems, that they are playing the roles of different parts inside one brain. But I’ve also done experiments with multiple humans and how different structures of human communication leads to better memory or problem-solving. Humans, also, we invent things; we innovate things in cultural accumulation, which requires [building] on a lot of … some people do something, I take that outcome, take another outcome, put them together, make something. Someone takes my approach and adds something to it; makes something else. So this kind of cultural accumulation, we have done some work on that with deep reinforcement learning models that share their replay buffer as a way of sharing skill with each other; however, as humans become a lot more accustomed to using LLMs and other generative AI, basically generative AI would start participating in this kind of cultural accumulation. So the notion of collective cognition, collective intelligence, and collective memory will now have to incorporate the idea of generative AI being a part of it. And so I’m also interested in different approaches to modeling that, understanding that, optimizing that, identifying in what ways it’s better.[3] We have found both in humans and in deep reinforcement learning agents, for instance, that particular structures of communication that are actually not the most energy-consuming one; it’s not all-to-all communication, but particular partially connected structures are better for innovation than others. And some other structures might be better for memory or collective memory converging with each other.[4] So I think it would be very interesting—the same way that we are looking at what kind of components talk to each other in one brain to solve certain problems—to think about what kind of structures or roles can interact with each other, in what shape and in what frequency of communication, in order to solve larger, sort of, cultural accumulation problems.

[MUSIC PLAYS]

LLORENS: Well, that’s a compelling vision. I really look forward to seeing how far you and the team can take it. And thanks for a fascinating discussion.

MOMENNEJAD: Thank you so much.

[MUSIC FADES]


[1] Momennejad notes that repetitive probing allowed she and her colleagues to report the mean and standard deviation of the accuracy over all the responses with corresponding statistics rather than merely reporting the first or the best response.

[2] Momennejad notes that a “convenient and interesting fact about these modules or components or roles is that they’re very similar to some components in reinforcement learning, like actor and critique and tree search. And people have made prefrontal cortex–inspired models in deep learning in the past. This affinity to RL makes it easier to extend this framework to realize various RL algorithms and the sorts of problems one could solve with them using LLMs. Another feature is that they don’t all solve the big problem. There’s an orchestrator that assigns subgoals and passes it on, then the actor’s input and output or the monitor or evaluator’s input and output are parts of the problem, not all of it. This makes the many calls to GPT-4 efficient and is comparable to the local view or access of heterogenous agents, echoing the classic features of a multi-agent framework.“

[3] Momennejad notes that one task she and her colleagues have used is similar to the game Little Alchemy: the players need to find elements, combine them, and create new components. There are multiple levels of hierarchy of innovation that are possible in the game; some of them combine components from different trajectories.

[4] Momennejad notes that this relates to some work she and her colleagues have done building and evaluating AI agents in multi-agent Xbox games like Bleeding Edge, as well.

The post AI Frontiers: Rethinking intelligence with Ashley Llorens and Ida Momennejad appeared first on Microsoft Research.

Read More

Abstracts: March 21, 2024

Abstracts: March 21, 2024

Microsoft Research Podcast - Abstracts hero with a microphone icon

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Researcher Chang Liu joins host Gretchen Huizinga to discuss Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning.” In the paper, Liu and his coauthors present M-OFDFT, a variation of orbital-free density functional theory (OFDFT). M-OFDFT leverages deep learning to help identify molecular properties in a way that minimizes the tradeoff between accuracy and efficiency, work with the potential to benefit areas such as drug discovery and materials discovery.

Transcript

[MUSIC PLAYS] 

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers. 

[MUSIC FADES] 

Today, I’m talking to Dr. Chang Liu, a senior researcher from Microsoft Research AI4Science. Dr. Liu is coauthor of a paper called “Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning.” Chang Liu, thanks for joining us on Abstracts


CHANG LIU: Thank you. Thank you for this opportunity to share our work. 

HUIZINGA: So in a few sentences, tell us about the issue or problem your paper addresses and why people should care about this research. 

LIU: Sure. Since this is an AI4Science work, let’s start from this perspective. About science, people always want to understand the properties of matters, such as why some substances can cure disease and why some materials are heavy or conductive. For a very long period of time, these properties can only be studied by observation and experiments, and the outcome will just look like magic to us. If we can understand the underlying mechanism and calculate these properties on our computer, then we can do the magic ourselves, and it can, hence, accelerate industries like medicine development and material discovery. Our work aims to develop a method that handles the most fundamental part of such property calculation and with better accuracy and efficiency. If you zoom into the problem, properties of matters are determined by the properties of molecules that constitute the matter. For example, the energy of a molecule is an important property. It determines which structure it mostly takes, and the structure indicates whether it can bind to a disease-related biomolecule. You may know that molecules consist of atoms, and atoms consist of nuclei and electrons, so properties of a molecule are the result of the interaction among the nuclei and the electrons in the molecule. The nuclei can be treated as classical particles, but electrons exhibit significant quantum effect. You can imagine this like electrons move so fast that they appear like cloud or mist spreading over the space. To calculate the properties of the molecule, you need to first solve the electronic structure—that is, how the electrons spread over this space. This is governed by an equation that is hard to solve. The target of our research is hence to develop a method that solves the electronic structure more accurately and more efficiently so that properties of molecules can be calculated in a higher level of accuracy and efficiency that leads to better ways to solve the industrial problems. 

HUIZINGA: Well, most research owes a debt to work that went before but also moves the science forward. So how does your approach build on and/or differ from related research in this field? 

LIU: Yes, there are indeed quite a few methods that can solve the electronic structure, but they show a harsh tradeoff between accuracy and efficiency. Currently, density functional theory, often called DFT, achieves a preferred balance for most cases and is perhaps the most popular choice. But DFT still requires a considerable cost for large molecular systems. It has a cubic cost scaling. We hope to develop a method that scales with a milder cost increase. We noted an alternative type of method called orbital-free DFT, or called OFDFT, which has a lower order of cost scaling. But existing OFDFT methods cannot achieve satisfying accuracy on molecules. So our work leverages deep learning to achieve an accurate OFDFT method. The method can achieve the same level of accuracy as conventional DFT; meanwhile, it inherits the cost scaling of OFDFT, hence is more efficient than the conventional DFT. 

HUIZINGA: OK, so we’re moving acronyms from DFT to OFDFT, and you’ve got an acronym that goes M-OFDFT. What does that stand for? 

LIU: The M represents molecules, since it is especially hard for classical or existing OFDFT to achieve a good accuracy on molecules. So our development tackles that challenge. 

HUIZINGA: Great. And I’m eager to hear about your methodology and your findings. So let’s go there. Tell us a bit about how you conducted this research and what your methodology was. 

LIU: Yeah. Regarding methodology, let me delve into a bit into some details. We follow the formulation of OFDFT, which solves the electronic structure by optimizing the electron density, where the optimization objective is to minimize the electronic energy. The challenge in OFDFT is, part of the electronic energy, specifically the kinetic energy, is hard to calculate accurately, especially for molecular systems. Existing computation formulas are based on approximate physical models, but the approximation accuracy is not satisfying. Our method uses a deep learning model to calculate the kinetic energy. We train the model on labeled data, and by the powerful learning ability, the model can give a more accurate result. This is the general idea, but there are many technical challenges. For example, since the model is used as an optimization objective, it needs to capture the overall landscape of the function. The model cannot recover the landscape if only one labeled data point is provided. For this, we made a theoretical analysis on the data generation method and found a way to generate multiple labeled data points for each molecular structure. Moreover, we can also calculate a gradient label for each data point, which provides the slope information on the landscape. Another challenge is that the kinetic energy has a strong non-local effect, meaning that the model needs to account for the interaction between any pair of spots in space. This incurs a significant cost if using the conventional way to represent density—that is, to using a grid. For this challenge, we choose to expand the density function on a set of basis functions and use the expansion coefficients to represent the density. The benefit is that it greatly reduces the representation dimension, which in turn reduces the cost for non-local calculation. These two examples are also the differences from other deep learning OFDFT works. There are more technical designs, and you may check them in the paper. 

HUIZINGA: So talk about your findings. After you completed and analyzed what you did, what were your major takeaways or findings? 

LIU: Yeah, let’s dive into the details, into the empirical findings. We find that our deep learning OFDFT, abbreviated as M-OFDFT, is much more accurate than existing OFDFT methods with tens to hundreds times lower error and achieves the same level of accuracy as the conventional DFT. 

HUIZINGA: Wow … 

LIU: On the other hand, the speed is indeed improved over conventional DFT. For example, on a protein molecule with more than 700 atoms, our method achieves nearly 30 times speedup. The empirical cost scaling is lower than quadratic and is one order less than that of conventional DFT. So the speed advantage would be more significant on larger molecules. I’d also like to mention an interesting observation. Since our method is based on deep learning, a natural question is, how accurate would the method be if applied to much larger molecules than those used for training the deep learning model? This is the generalization challenge and is one of the major challenges of deep learning method for molecular science applications. We investigated this question in our method and found that the error increases slower than linearly with molecular size. Although this is not perfect since the error is still increasing, but it is better than using the same model to predict the property directly, which shows an error that increases faster than linearly. This somehow shows the benefits of leveraging the OFDFT framework for using a deep learning method to solve molecular tasks. 

HUIZINGA: Well, let’s talk about real-world impact for a second. You’ve got this research going on in the lab, so to speak. How does it impact real-life situations? Who does this work help the most and how? 

LIU: Since our method achieves the same level of accuracy as conventional DFT but runs faster, it could accelerate molecular property calculation and molecular dynamic simulation especially for large molecules; hence, it has the potential to accelerate solving problems such as medicine development and material discovery. Our method also shows that AI techniques can create new opportunities for other electronic structure formulations, which could inspire more methods to break the long-standing tradeoff between accuracy and efficiency in this field. 

HUIZINGA: So if there was one thing you wanted our listeners to take away, just one little nugget from your research, what would that be? 

LIU: If only for one thing, that would be we develop the method that solves molecular properties more accurately and efficiently than the current portfolio of available methods. 

HUIZINGA: So finally, Chang, what are the big unanswered questions and unsolved problems that remain in this field, and what’s next on your research agenda? 

LIU: Yeah, sure. There indeed remains problems and challenges. One remaining challenge mentioned above is the generalization to molecules much larger than those in training. Although the OFDFT method is better than directly predicting properties, there is still room to improve. One possibility is to consider the success of large language models by including more abundant data and more diverse data in training and using a large model to digest all the data. This can be costly, but it may give us a surprise. And another way we may consider is to incorporate mathematical structures of the learning target functional into the model, such as convexity, lower and upper bounds, and some invariance. And such structures could regularize the model when applied to larger systems than it has seen during training. So we have actually incorporated some such structures into the model, for example, the geometric invariance, but other mathematical properties are nontrivial to incorporate. We made some discussions in the paper, and we’ll engage working on that direction in the future. The ultimate goal underlying this technical development is to build a computational method that is fast and accurate universally so that we can simulate the molecular world of any kind. 

[MUSIC PLAYS] 

HUIZINGA: Well, Chang Liu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts. You can also read it on arXiv, or you can check out the March 2024 issue of Nature Computational Science. See you next time on Abstracts

[MUSIC FADES]

The post Abstracts: March 21, 2024 appeared first on Microsoft Research.

Read More

Research Focus: Week of March 18, 2024

Research Focus: Week of March 18, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus March 20, 2024

Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning

Large language models (LLMs) have shown impressive capabilities, yet they still struggle with math reasoning. In a recent paper: Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning, researchers from Microsoft propose CoT-Influx, a novel approach that pushes the boundary of few-shot chain-of-Thought (CoT) learning to improve LLM mathematical reasoning.

Given that adding more concise CoT examples in the prompt can improve LLM reasoning performance, CoT-Influx employs a coarse-to-fine pruner to maximize the input of effective and concise CoT examples. The pruner first selects as many crucial CoT examples as possible and then prunes unimportant tokens to fit the context window. A math reasoning dataset with diverse difficulty levels and reasoning steps is used to train the pruner, along with a math-specialized reinforcement learning approach. As a result, by enabling more CoT examples with double the context window size in tokens, CoT-Influx significantly outperforms various prompting baselines across various LLMs (LLaMA2-7B, 13B, 70B) and 5 math datasets, achieving up to 4.55% absolute improvements. Remarkably, without any fine-tuning, LLaMA2-70B with CoT-Influx surpasses GPT-3.5 and a wide range of larger LLMs (PaLM, Minerva 540B, etc.) on the GSM8K. CoT-Influx serves as a plug-and-play module for LLMs and is compatible with most existing reasoning prompting techniques, such as self-consistency and self-verification.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


From User Surveys to Telemetry-Driven Agents: Exploring the Potential of Personalized Productivity Solutions

Organizations and individuals continuously strive to enhance their efficiency, improve time management, and optimize their work processes. Rapid advancements in AI, natural language processing, and machine learning technologies create new opportunities to develop tools that boost productivity. 

In a recent paper: From User Surveys to Telemetry-Driven Agents: Exploring the Potential of Personalized Productivity Solutions, researchers from Microsoft present a comprehensive, user-centric approach to understand preferences in AI-based productivity agents and develop personalized solutions. The research began with a survey of 363 participants, seeking to reveal users’ specific needs and preferences for productivity agents such as relevant productivity challenges of information workers, preferred communication style and approach towards solving problems, and privacy expectations. With the survey insights, the researchers then developed a GPT-4 powered personalized productivity agent that uses telemetry data gathered from information workers via Viva Insights to provide tailored assistance. The agent’s performance was compared with alternative productivity-assistive tools, such as the traditional dashboard and AI-enabled summaries, in a study involving 40 participants. The findings highlight the importance of user-centric design, adaptability, and the balance between personalization and privacy in AI-assisted productivity tools. The insights distilled from this study could support future research to further enhance productivity solutions, ultimately leading to optimized efficiency and user experiences for information workers.


LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

The size of the context window of a large language model (LLM) determines the amount of text that can be entered for processing to generate responses. The window size is specifically measured by anumber of tokens—larger windows are more desirable. However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. 

In a recent paper: LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, researchers from Microsoft introduce a new method that extends the context window of pre-trained LLMs to an impressive 2.048 million tokens, without requiring direct fine-tuning on texts with extremely long lengths, which are scarce, while maintaining performance at the level of the original short context window. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of this method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding and can reuse most pre-existing optimizations.


Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants

Conversational interactions with large language models (LLMs) enable programmers to obtain natural language explanations for various software development tasks. However, LLMs often leap to action without sufficient context, giving rise to implicit assumptions and inaccurate responses. Conversations between developers and LLMs are primarily structured as question-answer pairs, where the developer is responsible for asking the right questions and sustaining conversations across multiple turns.  

In a recent paper: Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants, researchers from Microsoft draw inspiration from interaction patterns and conversation analysis to design Robin, an enhanced conversational AI-assistant for debugging. Robin works with the developer collaboratively, creating hypotheses about the bug’s root cause, testing them using IDE debugging features such as breakpoints and watches, and then proposing fixes. A user study with 12 industry professionals shows that equipping the LLM-driven debugging assistant to (1) leverage the insert expansion interaction pattern; (2) facilitate turn-taking; and (3) utilize debugging workflows, leads to lowered conversation barriers, effective fault localization, and 5x improvement in bug resolution rates.


Ironies of Generative AI: Understanding and mitigating productivity loss in human-AI interactions

Generative AI (GenAI) systems, which can produce new content based on input like code, images, speech, video, and more, offer opportunities to increase user productivity in many tasks, such as programming and writing. However, while they boost productivity in some studies, many others show that users are working ineffectively with GenAI systems and actually losing productivity. These ‘ironies of automation’ have been observed for over three decades in human factors research on automation in aviation, automated driving, and intelligence.  

In a recent paper: Ironies of Generative AI: Understanding and mitigating productivity loss in human-AI interactions, researchers from Microsoft draw on this extensive research alongside recent GenAI user studies to outline four key reasons for why productivity loss can occur with GenAI systems: 1) a shift in users’ roles from production to evaluation; 2) unhelpful restructuring of workflows; 3) interruptions; and 4) a tendency for automation to make easy tasks easier and hard tasks harder. We then suggest how human factors research can also inform GenAI system design to mitigate productivity loss by using approaches such as continuous feedback, system personalization, ecological interface design, task stabilization, and clear task allocation. Grounding developments in GenAI system usability in decades of research aims to ensure that the design of human-AI interactions in this rapidly moving field learns from history instead of repeating it. 

The post Research Focus: Week of March 18, 2024 appeared first on Microsoft Research.

Read More

Exploring how context, culture, and character matter in avatar research

Exploring how context, culture, and character matter in avatar research

This research paper was presented at the IEEE VR Workshop Series on Animation in Virtual and Augmented Environments (opens in new tab) (ANIVAE 2024), the premier series on 3D content creation for simulated training in extended reality.

IEEE Conference logo with the paper featured

Face-to-face communication is changing, moving beyond physical interaction to include video conferencing and AR/VR platforms, where the participants are represented by avatars. Sophisticated avatars, animated through motion tracking, can realistically portray their human counterparts, but they can also suffer from noise, such as jitter and distortion, reducing their realism. Advances in motion-capture technology aim to reduce such issues, but they come with higher development costs and require additional time due to the need for advanced components. While some noise is inevitable, it’s important to determine acceptable types and levels to efficiently develop and introduce AR/VR devices and avatars to the market. Additionally, understanding how noise impacts avatar-based communication is essential for creating more inclusive avatars that accurately represent diverse cultures and abilities, enhancing the user experience.

In our paper, “Ecological Validity and the Evaluation of Avatar Facial Animation Noise,” presented at ANIVAE 2024, we explore the challenge of evaluating avatar noise without a standardized approach. Traditional methods, which present participants with isolated facial animation noise to gauge perception thresholds, fall short of reflecting real-life avatar interactions. Our approach emphasizes ecological validity—the extent to which experiments mimic real-world conditions—as central in assessing avatar noise. We discovered this significantly influences participants’ response to avatars, highlighting the impact of context on noise perception. Our goal is to improve avatar acceptance, inclusivity, and communication by developing noise evaluation methods that better represent actual experiences. 

Seeing the big picture  

To set up our study, we animated two avatars using motion capture, as depicted in Figure 1 (A). We recorded the performance of two professional actors enacting a scene between an architect and a client discussing home renovations and examining a 3D model of the proposed design. We used two proprietary characters for the avatars, whose faces were animated with 91 expression blendshapes. This allowed for a broad range of facial expressions and subtle variations in emotions, contributing to a more realistic animation. To examine different dynamics, we created six variations of the scene, changing the characters’ gender, role, and whether they agreed on the renovation plan.

Figure 1: A. Motion capture of a social interaction scenario for the experiment. B. The motion capture was remapped to stylized avatars. C. Participants experienced the scene wearing a HoloLens 2 and responded to questions on a tablet app. D. The avatars’ facial features were degraded with different types of animation noises of varying severity.
Figure 1: A. Motion capture of a social interaction scenario for the experiment. B. The motion capture was remapped to stylized avatars. C. Participants experienced the scene wearing a HoloLens 2 and responded to questions on a tablet app. D. The avatars’ facial features were degraded with different types of animation noises of varying severity.

Fifty-six participants engaged in two experiments to evaluate the impact of noise on avatar facial animation. The first experiment had low ecological validity. Participants viewed fragmented clips of dialogue through a Microsoft HoloLens 2 device and used a slider to adjust any noise to an acceptable level. The second experiment featured high ecological validity, showing the scene in its full social context. Here, participants used a HoloLens 2 to judge the noise in facial expressions as either “appropriate” or “inappropriate” for the conversation. In contrast to the first experiment, this method considered the social aspects of context, culture, and character. 

Results indicate that noise was less distracting when participants viewed the scene in its entirety, revealing a greater tolerance for noise in high ecological validity scenarios. Isolated clips, on the other hand, led to greater annoyance with facial animation noise, suggesting the importance of social context over hyper-realistic animation. 

Cultural observations showed that noise perception was influenced by implicit cultural norms, particularly around gender roles and agreement levels. For example, in the second experiment, where participants viewed the conversation within its greater social context (high ecological validity), noise was deemed “appropriate” when the female architect agreed with the male client and “inappropriate” when she disagreed, revealing potential gender biases not observed in reversed gender roles. These findings emphasize the importance of applying high ecological validity in experiments to uncover socio-cultural influences on avatar perception. They also underscore the need to carefully consider context and cultural dynamics in avatar design. 

Finally, we explored the character trait of empathy. Participants with lower empathy scores were more critical of noise in context-rich scenarios. This indicates that experiments focusing solely on low ecological validity might overlook important insights on how empathy influences responses to avatar facial animation noise.

MICROSOFT RESEARCH PODCAST

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI.


Avatars need to be studied in realistic situations 

When people communicate, they engage in a complex process influenced by environment, cultural background, and the nonverbal cues they perceive and interpret. By prioritizing high ecological validity in studies on avatar perception, researchers can uncover these socio-cultural influences and trust that their findings are relevant and applicable to real-life interactions within digital spaces. 

Our research examines how different combinations of demographic characteristics change the way people react to avatars, and we hope to encourage more inclusivity in avatar design. It’s essential to have an established set of guidelines to achieve this goal, and this work is one step in that direction. While our study’s scope is limited, its methodology can be applied broadly across different devices and settings.

Acknowledgements

We would like to thank Ken Jakubzak, James Clemoes, Cornelia Treptow, Michaela Porubanova, Kerry Read, Daniel McDuff, Marina Kuznetsova and Mathew Lamb for their research collaboration. We would also like to thank Shawn Bruner for providing the characters for the study and Panagiotis Giannakopoulos for leading the animation and motion capture pipelines.

The post Exploring how context, culture, and character matter in avatar research appeared first on Microsoft Research.

Read More

Scaling early detection of esophageal cancer with AI

Scaling early detection of esophageal cancer with AI

white icons of first aid kit, DNA strand, laptop monitor with overlapping eye, and microscope on a blue and green gradient background

Microsoft Research and Cyted have collaborated to build novel AI models (opens in new tab) to scale the early detection of esophageal cancer. The AI-supported methods demonstrated the same diagnostic performance as the existing manual workflow, potentially reducing the pathologist’s workload by up to 63%.

Esophageal cancer is the sixth most common cause of cancer deaths worldwide, in part because this disease is typically diagnosed late, making treatment difficult. Fewer than 1 in 5 patients survive five years after diagnosis, making early detection of this disease critical to improving a patient’s chances. One opportunity for early detection is to identify patients with a condition called Barrett’s esophagus (BE). Patients with BE are at an increased risk of developing cancer, though most never will. Chronic heartburn is a risk factor and a possible cause of Barrett’s.

Detecting BE dramatically improves a patient’s chances. Earlier detection of cancer and earlier start of treatment mean that more than 9 in 10 patients survive 5 years after diagnosis. However early detection of BE has typically involved an endoscopic biopsy, a procedure that many people find uncomfortable and invasive. It often requires sedation, is resource intensive, and increases the risk of complications.

A major step toward enabling large-scale screening for BE has been spearheaded by Cyted (opens in new tab), a start-up company at the forefront of medical innovation. Cyted has developed a capsule sponge device called EndoSign (opens in new tab)® – a dissolvable capsule on a string that expands into a small medical sponge once in the stomach. When pulled back out, it collects cells from the lining of the esophagus, which are then processed, placed on slides, stained, and scanned for digital analysis. 

The capsule sponge is easier to administer and less costly than endoscopy. But a pathologist still needs to review the digitized slides to determine the presence of any goblet cells, a type of cell normally found in the intestinal lining, which would indicate BE if found in the esophagus. These images are huge (up to 100,000 by 100,000 pixels – the size of a squash court if printed at the typical photo resolution of 300dpi) – yet may contain only a few goblet cells per image, each cell just a few pixels large. To identify BE, pathologists need to use slides from two stains, H&E (a routine stain for observing cell structure) and TFF3 (a special stain just to find goblet cells). Since most patients with heartburn will not have BE, pathologists spend most of their time examining negative cases, taking away time in which they could be prioritizing high-risk cases without more sophisticated approaches to analysis.

Microsoft Research and Cyted have collaborated to build novel AI models that can efficiently check the slides for goblet cells, using either the H&E or TFF3 stains. This joint effort has led to a Nature Communications paper titled “Enabling large-scale screening of Barrett’s esophagus using weakly supervised deep learning in histopathology (opens in new tab).” Our study uses the strength of transformer-based multiple instance learning to assist in the screening of BE. In the paper, we introduce two major innovations. First, we show that the AI models can be built solely from the pathologists’ findings on whether BE is present, eliminating the need for expensive pixel-level annotations. This means that existing large capsule sponge screening datasets can be used to further improve the performance of the model. Secondly, we demonstrate that goblet cells can be detected with high accuracy using only the H&E slides. This is the most common routine stain in pathology, and it suggests that the more time-consuming and costly specialized staining, TFF3, could be skipped (see Figure 2 below).

Fig1
Figure 1: The top-left contains a thumbnail image of an H&E slide with goblet cells. In the bottom left, the attention maps of the AI model show which image regions the model uses to make its final prediction. Zooming in to those areas (bottom right), we see that image parts that receive high attention contain goblet cells. We validate that these are indeed goblet cells by looking at the corresponding TFF3 slide (top right), where goblet cells are shown as brown.

In the paper, we further discuss different AI-assisted workflows designed to optimize the screening process. The first workflow necessitates a pathologist’s review only if either the H&E or TFF3 models predict a sample as positive. This method can achieve the same diagnostic performance as the existing manual workflow in terms of sensitivity and specificity, potentially reducing the pathologist’s workload by 52% (see Figure 3 below).

The second proposed workflow reduces the need for pathologist review by 63% of the original load, by restricting reviews to positive predictions from the H&E model only. However, this comes at slightly reduced sensitivity, since goblet cells are more clearly visible in the TFF3 stain.

diagram
Figure 2: Proposed AI-assisted workflows. a) Workflow “Pathologist reviews any positives” b) Workflow “Pathologist reviews H&E model positives”
Proposed AI-assisted workflow Pathologist review (per-cent of all cases) TFF3 staining required (per-cent of all cases) Sensitivity @ Specificity 1.00
Pathologist reviews any positive 48% 100% 1.00
Pathologist reviews H&E model positives 37% 37% 0.91
Figure 3: Quantitative comparison of the proposed workflows. For the two workflows described in Figure 2, we compare the pathologist workload as a fraction of the total number of cases, the amount of images for which a costly TFF3 stain is required, and the resulting accuracy numbers.

Our collaboration with Cyted demonstrates the transformative potential of integrating advanced AI models into clinical workflows, saving valuable time for pathologists. As we move forward, the scalability of this technology holds the promise for widespread adoption of early detection in the fight against esophageal cancer.

“This represents a significant step in our fight against esophageal cancer, offering the potential to save countless lives through early detection with our minimally-invasive capsule sponge technology,” said Cyted CEO Marcel Gehrung. “Our collaboration with Microsoft Research has been instrumental in pushing the boundaries of what’s possible in medical imaging and screening technologies, creating optimal efficiencies from start to finish of the testing process.”

We have open sourced code to build these models (opens in new tab), which is designed to be scalable to very large datasets, using Azure Machine Learning (opens in new tab). This flexibility allows other researchers and institutions to adapt and enhance our code according to their specific needs. Importantly, our code represents a significant advancement over previous work in the field. Unlike earlier approaches that focused solely on training the multiple instance and attention layers, our code allows for end-to-end fine-tuning, including the image encoder. This comprehensive approach to training ensures optimal performance and accuracy, setting a new standard for AI models in histopathology. 

“The open sourcing of this code has helped us to advance our research in the field of early cancer detection,” said Florian Markowetz, Professor of Computational Oncology at the University of Cambridge, and Senior Group Leader at Cancer Research UK Cambridge Institute. “Several key features will soon be integrated into ongoing clinical trials, where we aim to improve the detection of Barrett’s esophagus in patients and ultimately treat more cancers through early intervention. Furthermore, these features will help improve the workflow of pathologists and identify key regions quicker, enabling clinicians to tackle more cases with greater reliability.”

By sharing our work, we aim not only to enhance the detection of BE and esophageal cancer, but also to empower researchers and clinicians around the world to leverage this technology in their fight against cancer[1]. Because our code can be used as a building block to develop AI models for histopathology slides, it may also potentially be applied to other cancer types. It is our hope that this open-source initiative will foster innovation and collaboration, and ultimately lead to breakthroughs that save lives.

As researchers, it has been exciting to work closely with Cyted and be part of the long path towards early detection of esophageal cancer. Cross-discipline collaborations like this are excellent opportunities to solve complex clinical problems. With AI models built using the principles of responsible AI like fairness, privacy and security, and reliability and safety, we can ultimately make a tangible difference to patient outcomes.

Acknowledgement

Thank you to the team: Kenza Bouzid, Harshita Sharma, Sarah Killcoyne, Daniel C. Castro, Anton Schwaighofer, Max Ilse, Valentina Salvatelli, Ozan Oktay, Sumanth Murthy, Lucas Bordeaux, Luiza Moore, Maria O’Donovan, Anja Thieme, Hannah Richardson, Aditya Nori, Marcel Gehrung, Javier Alvarez-Valle


[1] (opens in new tab) Code released for research use only. Full disclaimer here: https://github.com/microsoft/be-trans-mil (opens in new tab)

The post Scaling early detection of esophageal cancer with AI appeared first on Microsoft Research.

Read More

Improving LLM understanding of structured data and exploring advanced prompting methods

Improving LLM understanding of structured data and exploring advanced prompting methods

This research paper was presented at the 17th ACM International Conference on Web Search and Data Mining (opens in new tab) (WSDM 2024), the premier conference on web-inspired research on search and data mining.

WSDM logo in white to the left of the first page of the

In today’s data-driven landscape, tables are indispensable for organizing and presenting information, particularly text. They streamline repetitive content, enhance data manageability, enable easier data analysis, and improve machine processing capabilities. Meanwhile, large language models (LLMs) are advancing in their ability to tackle challenges associated with natural language, but the degree to which they understand tables included in their prompts remains an open question. Our research aims to explore this question and improve how LLMs use and work with table-based data.

Our paper, “Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study (opens in new tab),” presented at WSDM 2024 (opens in new tab), investigates what kinds of prompts most effectively enable LLMs to understand tables; how much LLMs inherently detect structured data; and how LLMs’ existing knowledge can be harnessed to improve this understanding. We also analyze the complex trade-off among multiple combinations of input designs and overall performance.

To address these questions, we propose a new benchmark called Structural Understanding Capabilities (SUC), shown in Figure 1 (a), which focuses on specific tasks to assess LLMs’ ability to understand structured data in tables and compare different types of prompts. We conducted a series of experiments using different prompt designs. Our findings, detailed in the paper, evaluate how each design enhances LLMs’ ability to work with tables. 

The image (a) is a flowchart with three main columns that illustrate the stages, capabilities, and tasks associated with a process benchmarked by SUC (Semantic Understanding Capability), and their application in input designs. Here is the detailed alt text for the image: Flowchart illustrates the detailed design of the Semantic Understanding Capability Benchmark. The leftmost column is labeled 'Stages' with two main stages: 'Partition & Parsing' in blue and 'Search & Retrieval' in pink. Each stage is associated with 'Capabilities' in the middle column. 'Partition & Parsing' includes 'Structural Description Detection', 'Format Understanding', and 'Hierarchy Detection'. 'Search & Retrieval' includes 'Grounding/Locating' and 'Operation Reasoning'. These capabilities correspond to 'Tasks' in the third column. For 'Partition & Parsing', tasks are 'Table Partition', 'Table Size Detection', and 'Hierarchy Detection'. For 'Search & Retrieval', tasks are 'Cell Lookup & Reverse Lookup' and 'Column & Row Retrieval'.  

 

To the right of these columns is image (b) labeled 'Input Designs' connected to 'Partition Mark', 'Serialization', 'Role Prompting', 'Order Permutation', and 'Format Explanation'. These are further linked to types of 'Markup Languages' represented in green boxes: 'HTML', 'XML', 'Markdown', and more indicated by ellipses. Image (b) covers the input designs for the SUC evaluation.
Figure 1. The SUC benchmark and prompt designs for evaluation.

Insights and findings using the SUC benchmark

Based on humans’ perception of tables, we developed tasks to evaluate how LLMs understand them. We conducted evaluations on GPT-3.5 and GPT-4 and discovered that the results depended on certain input factors, such as table format, content order, and partition marks. The findings, detailed in Tables 1 and 2, reveal some notable and unexpected findings:

  • Delimiter-separated formats (e.g., CSV, TSV), underperformed compared with HTML by 6.76 percent.
  • Using HTML and few-shot learning consistently improved performance. The effectiveness of other approaches, such as format explanation, role prompting, order change, and partition marks, varied depending on task difficulty and the required capacity.
  • Despite the simplicity of the benchmark tasks, the highest overall accuracy across seven tasks is only 65.43 percent. This underscores the need for LLMs to have better awareness of table structures and highlights areas for further improvement in table serialization.

Our exploration suggests that:

  • LLMs have a basic understanding of table structures but are far from perfect, even in straightforward tasks like detecting the number of columns and rows.
  • Choosing the right combination of input designs can significantly enhance LLMs’ understanding of structured data.

Our findings revealed significant performance gaps in downstream tasks, attributed to the different combinations of serialization functions and input options. These gaps remained even with GPT-4, underscoring the effectiveness of our benchmark approach.

This is a table regarding the comparison table displaying the accuracy (Acc) of GPT-4 versus previous models in different tasks. Tasks include Table Partition, Cell Lookup, Reverse Lookup, Column Retrieval, Row Retrieval, Size Detection, and Merged Cell Detection. The data formats compared are NL + Sep, Markdown, JSON, XML, and HTML. GPT-4 shows improved accuracy across nearly all tasks and formats compared to its predecessors, with notable high accuracy in the HTML format for Table Partition and Merged Cell Detection tasks.
Table 1. SUC benchmark evaluations on table formats.
This table presents the comparison of accuracy (Acc) and changes in accuracy (Δ) for different input designs using GPT-4 on various tasks. The tasks include Table Partition, Cell Lookup, Reverse Lookup, Column Retrieval, Row Retrieval, Size Detection, and Merged Cell Detection. The input designs tested are Markup Language HTML with and without various components such as format explanation, partition mark, role prompting, and change order, as well as without 1-shot learning. The last row shows the performance of GPT-4 with Language HTML. The table displays positive and negative changes in percentages with respective tasks, highlighting the impact of each input design modification on the model's accuracy.
Table 2. Ablation study of input designs using the SUC benchmark.

Improved performance with self-augmented prompting

Based on these benchmark evaluations, we investigated how LLMs’ existing knowledge could be used to enhance their understanding of structured data. To do this, we introduced self-augmentation, a model-agnostic technique that improves structural prompting—enabling LLMs to identify key values and ranges by tapping into their own internal knowledge. This technique simplifies and optimizes how LLMs utilize their existing knowledge base to improve their understanding of structured content, allowing them to generate intermediate structural insights. This process is shown in Figure 2, with the results detailed in Table 3.

The image depicts a diagram showing the Self-augmented Prompting workflow that involves an initial table, an intermediate output, and a final output. Here is the detailed alt text for the image: On the left, there's a table with the title 'Antoine Salamin' and columns labeled 'Year', 'Team', 'Driver', 'Races', and 'Pos'. Two rows are visible with the years 1983 and 1989, team name starting with 'Swit...', driver name starting with 'Antoine...', and positions '29th' and '7th' highlighted in the last visible row. Below the table is a box labeled 'Table & Other info' and an arrow pointing right labeled '1st request' with the text 'Identify critical values and ranges of the table'. 

 

In the center, a green box with rounded corners titled 'Intermediate Output' contains text summarizing the table's content, mentioning Antoine Salamin's results from 1983 to 1989, the number of races, podiums, and points range. There's an arrow looping back to the first box with 'LLM' written above it, indicating a feedback loop for further processing. 

 

On the right, a blue box with rounded corners titled 'Final Output' contains a narrative description saying 'In 1989, Antoine Salamin drove a Porsche 962C for the Swiss Team Salamin, powered by a Porsche turbo Flat-6 engine. He competed in two races, achieving one podium and 17 points, finishing 7th overall.' An arrow labeled '2nd request' points from the '1st request' to the 'Intermediate Output' and another from there to the 'Final Output', indicating the sequence of processing requests.
Figure 2. Self-augmented prompting.
This table is comparing the accuracy (Acc) and BLEU scores for different types of input choices on various question-answering datasets: TabFact, HybridQA, SQA, Feverous, and ToTTo. The types include 1-shot and self-explanation approaches (SA) with various modifications such as without table size, partition mark, format explanation, role prompting, critical values and ranges identification, and structural information description. Each row shows the impact of these modifications on the model's performance, with accuracy percentages for the datasets and BLEU-1 to BLEU-4 scores for the ToTTo dataset.
Table 3. Evaluation of downstream tasks. “SA” refers to self-augmented prompting.

Looking forward

Our study sets a key benchmark in expanding the capabilities of LLMs to better understand structured table data, moving beyond conventional natural language processing tasks. We suggest future research should prioritize the integration of structural information to improve performance with various structured data types. Additionally, we propose exploring LLMs’ ability to use external tools or agents for improved handling of structured data, opening new avenues for application.

The post Improving LLM understanding of structured data and exploring advanced prompting methods appeared first on Microsoft Research.

Read More

Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies

Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies

Chris Bishop at Research Forum

Research advances are driving real-world impact faster than ever. Recent developments in AI are reshaping the way people live, work, and think. In the latest episode of Microsoft Research Forum (opens in new tab), we explore how AI is transforming health care and the natural sciences, the intersection of AI and society, and the continuing evolution of foundational AI technologies. 

Below is a brief recap of the event, including select quotes from the presentations. Full replays of each session and presentation will be available soon.

Keynote: The Revolution in Scientific Discovery

Chris Bishop, Technical Fellow and Director, Microsoft Research AI4Science 

As in our debut event on January 30, this edition of Research Forum began with a keynote address by a leader from Microsoft Research. Chris Bishop shared some exciting real-world progress being made by his team toward modelling and predicting natural phenomena.

Chris Bishop: “In my view, the most important use case of AI will be to scientific discovery. And the reason I believe this is that it’s our understanding of the natural world obtained through scientific discovery, together with its application in the form of technology that has really transformed the human species.”

Panel discussion: Transforming the natural sciences with AI

Bonnie Kruft, Partner Deputy Director, Microsoft Research AI4Science (Host)
Rianne van den Berg, Principal Research Manager, Microsoft Research AI4Science 
Tian Xie, Principal Research Manager, Microsoft Research AI4Science 
Tristan Naumann, Principal Researcher, Microsoft Research Health Futures 
Kristen Severson, Senior Researcher, Microsoft Research New England 
Alex Lu, Senior Researcher, Microsoft Research New England

In a discussion hosted by Bonnie Kruft, Microsoft researchers presented their latest advancements in the fields of foundation models, drug discovery, material design, and machine learning. Panelists highlighted deep learning’s growing impact on the natural sciences.

Tristan Naumann: “Much of the data we have in healthcare is not nicely structured in a clean and easy to use way. And so, one of the things that’s really incredible about some of these recent advances in generative AI, specifically large language models (and) multimodal models, is really this opportunity to have a tool for universal structuring and unlocking some of that data quickly and efficiently, really opens up a lot of new opportunities.” 

Tian Xie: “Similar (to) the field of health and in biology, machine learning is really beginning to interrupt some of the traditional pipelines that happened in materials discovery.”

Kristen Severson: “We have a lot of knowledge about diseases and how they manifest and we don’t want to leave that information on the table when we train a machine learning model. So, there’s not an interest in using solely black box approaches, but instead (in) using what’s already known.”

Alex Lu: “If you look at what particularly differentiates biology and I suspect by extension a lot of other scientific disciplines, the whole point is to try to discover something new. So, by definition, what that new thing is is not going to be captured in your original distribution of data.” 

Rianne van den Berg: “One particular class of generative models that I’m very excited about and that’s becoming increasingly popular is that of diffusion models and score-based generative models. These models have been super successful already, for instance in high resolution image generation and video, and they’re also very naturally suited to target scientific discovery.” 

Lightning talk: What’s new in AutoGen? 

Chi Wang, Principal Researcher, Microsoft Research AI Frontiers 

Chi Wang presented the latest updates on AutoGen – the multi-agent framework for next generation AI applications. The discussion covered milestones achieved, community feedback, exciting new features, and the research and related challenges on the road ahead. He also announced a recent milestone. 

Chi Wang: “Our initial multiagent experiment on the challenging GAIA benchmark turned out to achieve the number one accuracy in the leaderboard in all three levels. That shows the power of AutoGen in solving complex tasks and big potential.”

Lightning talk: The metacognitive demands and opportunities of generative AI

Lev Tankelevitch, Senior Behavioral Science Researcher, Microsoft Research Cambridge (UK)

Lev Tankelevitch explored how metacognition—the psychological capacity to monitor and regulate one’s thoughts and behaviors—provides a valuable lens for understanding and addressing the usability challenges of generative AI systems. This includes prompting, assessing and relying on outputs, and workflow optimization, which require a high degree of metacognitive monitoring and control.

Lev Tankelevitch: “We believe that a metacognitive perspective can help us analyze, measure, and evaluate the usability challenges of generative AI, and it can help us design generative AI systems that can augment human agency and workflows.”

Lightning talk: Getting modular with language models: Building and reusing a library of experts for task generalization

Alessandro Sordoni, Principal Researcher, Microsoft Research Montreal

Alessandro Sordoni discussed recent research on building and re-using large collections of expert language models to improve zero-shot and few-shot generalization to unseen tasks.

Alessandro Sordoni: “Looking forward, I believe that an exciting direction would be to push this to fully decentralized training and continual improvement of language models in the sense that users can train their experts, then share them in the platform and the model gets better.” 

Lightning talk: GigaPath: Real-World Pathology Foundation Model

Naoto Usuyama, Principal Researcher, Microsoft Research Health Futures

Naoto Usuyama presented GigaPath, a novel approach for training large vision transformers for gigapixel pathology images, utilizing a diverse, real-world cancer patient dataset, with the goal of laying a foundation for cancer pathology AI.

Naoto Usuyama: “This project (GigaPath) is not possible without many, many collaborators, and we are just scratching the surface. So, I’m very excited, and I really hope we can unlock the full potential of real-world patient data and advanced AI for cancer care and research.”

Lightning talk: Generative AI and plural governance: Mitigating challenges and surfacing opportunities

Madeleine Daepp (opens in new tab), Senior Researcher, Microsoft Research Redmond
Vanessa Gathecha (opens in new tab), Applied Researcher and Policy Analyst, Baraza Media Lab

This talk featured two expert speakers. Madeleine Daepp discussed the potential impacts and challenges of generative AI in a year with over 70 major global elections. Vanessa Gatheca, a 2024 Microsoft AI and Society fellow (opens in new tab), discussed her work on disinformation in Kenya and Sub-Saharan Africa.

Madeleine Daepp: “The disruption of our digital public sphere is an all-of-society problem that requires an all-of-society response. The AI and Society fellows program is helping to build much needed connections across places, across academic disciplines, and across societal sectors to help us understand the problem and work toward an impactful response.” 

The post Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies appeared first on Microsoft Research.

Read More

Research Focus: Week of March 4, 2024

Research Focus: Week of March 4, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus 
Week of March 4, 2024

Generative Kaleidoscopic Networks

Neural networks are deep learning models that can be trained to learn complex patterns and relationships within data. In a recent paper: Generative Kaleidoscopic Networks, researchers from Microsoft detail how they discovered an “over-generalization” phenomenon, which indicates that the neural networks tend to learn many-to-one mappings. They then use this phenomenon to introduce a new paradigm of generative modeling by creating a dataset kaleidoscope, dubbed ‘Generative Kaleidoscopic Networks.’ The researchers are exploring theoretical explanations, experiments on multimodal data, and conditional generation using the Generative Kaleidoscopic Networks.

MNIST Kaleidoscope: Manifold learning is done on the MNIST data images with a Multilayer Perceptron model. We start with input noise vector sampled from a Uniform distribution and run the kaleidoscopic sampling algorithm. The transitioning between images demonstrate a kaleidoscopic effect until eventually the samples found a stable minima and converge at a digit.
MNIST Kaleidoscope: Manifold learning is done on the MNIST data images with a Multilayer Perceptron model. We start with input noise vector sampled from a Uniform distribution and run the kaleidoscopic sampling algorithm. The transitioning between images demonstrate a kaleidoscopic effect until eventually the samples found a stable minima and converge at a digit.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Text Diffusion with Reinforced Conditioning

Diffusion models are a type of machine learning model that have shown exceptional ability to generate high-quality images, videos, and audio. Due to their adaptiveness in iterative refinement, they offer potential for achieving better non-autoregressive sequence generation—which simultaneously predicts all elements of a sequence, rather than predicting the next element in a sequence.

However, existing text diffusion models have yet to fulfill this potential, due to challenges in handling the discreteness of language. In a recent paper: Text Diffusion with Reinforced Conditioning, researchers from Microsoft and external colleagues uncover two significant limitations in text diffusion models: degradation of self-conditioning during training and misalignment between training and sampling. In response, the researchers propose a novel model called TREC, which empowers text diffusion models with reinforced conditioning, mitigating the degradation by directly motivating quality improvements from self-conditions with reward signals. In the paper, which was presented at the 2024 Association for the Advancement of Artificial Intelligence conference (AAAI), they further propose time-aware variance scaling to address the misalignment issue.

Extensive experiments demonstrate the competitiveness of TREC against autoregressive, non-autoregressive, and diffusion baselines. Moreover, qualitative analysis shows its advanced ability to fully utilize the diffusion process in refining samples.


PRISE: Learning Temporal Action Abstractions as a Sequence Compression Problem

Temporal action abstractions, along with belief state representations, are powerful knowledge sharing mechanisms for sequential decision making. In a recent paper, PRISE: Learning Temporal Action Abstractions as a Sequence Compression Problem, researchers from Microsoft and University of Maryland propose a novel connection between the seemingly distant realms of training large language models (LLMs) and inducing temporal action abstractions for continuous control domains such as robotics. The researchers introduce an approach called Primitive Sequence Encoding (PRISE) that combines continuous action quantization with a subtle but critical component of LLM training pipelines — input tokenization via byte pair encoding (BPE) – to learn powerful variable-timespan action abstractions. They empirically show that high-level skills discovered by PRISE from a multitask set of robotic manipulation demonstrations significantly boost the performance of both multitask imitation learning and few-shot imitation learning on unseen tasks.

The post Research Focus: Week of March 4, 2024 appeared first on Microsoft Research.

Read More

Orca-Math: Demonstrating the potential of SLMs with model specialization

Orca-Math: Demonstrating the potential of SLMs with model specialization

abstract wave lines on a gradient background

Our work on Orca and Orca 2 demonstrated the power of improved training signals and methods to enhance the reasoning abilities of smaller language models, getting closer to the levels found in much larger language models. Orca-Math is another step in this direction, where we explore the capabilities of small language models (SLMs) when specialized in a certain area, in this case solving grade school math problems, which has long been recognized as a complex task for SLMs.

Orca-Math is a 7 billion parameters model created by fine-tuning the Mistral 7B model. Orca-Math achieves 86.81% on GSM8k pass@1, exceeding the performance of much bigger models including general models (e.g. LLAMA-2-70, Gemini Pro and GPT-3.5) and math-specific models (e.g. MetaMath-70B and WizardMa8th-70B). Note that the base model (Mistral-7B) achieves 37.83% on GSM8K.

Alt Text: Bar graph comparing GSM8K score of different models with an upward trend in quality. The models are LLAMA-2-70, GPT-3.5, Gemini Pro,  WizardMath-70B, MetaMath-70B and Orca-Math-7B. The graph shows that the Orca-Math-7B model outperforms other bigger models on GSM8K.
 Bar graph comparing GSM8K score of different models with an upward trend in quality. The models are LLAMA-2-70, GPT-3.5, Gemini Pro,  WizardMath-70B, MetaMath-70B and Orca-Math-7B. The graph shows that the Orca-Math-7B model outperforms other bigger models on GSM8K.

The state-of-the-art (SOTA) performance of the Orca-Math model can be attributed to two key insights:

  • Training on high-quality synthetic data with 200,000 math problems, created using multi-agents (AutoGen). This is smaller than other math datasets, which could have millions of problems. The smaller model and smaller dataset mean faster and cheaper training.
  • In addition to traditional supervised fine-tuning, the model was trained using an iterative learning process, where the model is allowed to practice solving problems and continues to improve based on feedback from a teacher.

Our findings show that smaller models are valuable in specialized settings, where they can match the performance of much larger models while also highlighting the potential of continual learning and using feedback to improve language models. We are making the dataset (opens in new tab) publicly available, along with a report (opens in new tab) describing the training procedure to encourage research on the improvement and specialization of smaller language models.

Teaching SLMs math

Solving mathematical word problems has long been recognized as a complex task for SLMs. Models that achieve over 80% accuracy on the GSM8K benchmark (GSM8K, which stands for Grade School Math 8K, is a dataset of 8,500 high-quality grade school mathematical word problems that require multi-step reasoning) typically exceed 30 billion parameters.

To reach higher levels of performance with smaller models, researchers often train SLMs to generate code, or use calculators to help avoid calculation errors. Additionally, they employ a technique called ensembling, in which the model is called up to 100 times, with each call reattempting to solve the problem. Ensembling provides a substantial boost in accuracy but at a significant increase in compute cost increase, due to multiple calls to the model. 

This research aims to explore how far we can push the native ability of smaller language models when they are specialized to solve math problems, without the use of external tools, verifiers or ensembling. More specifically, we focus on two directions:

AgentInstruct

Previous work on synthetic data creation often uses frontier models to generate similar problems based on a seed problem. Providing paraphrases of the seed with different numbers and attributes can be useful for creating training data for the smaller model. We propose employing multi-agent flows, using AutoGen, to create new problems and solutions, which can not only create more demonstrations of the problem but also increase the diversity and range of difficulty of the problems. 

To generate more challenging problems, we create a setup with a team of agents working collaboratively to create a dataset geared toward a predefined objective. For example, we can use two agents, namely Suggester and Editor. The Suggester examines a problem and proposes several methods for increasing its complexity, while the Editor takes the original word problem and the Suggester’s recommendations to generate an updated, more challenging problem. This iterative process can occur over multiple rounds, with each round further increasing the complexity of the previously generated problem. A third agent can then verify that the problem is solvable and create the solution.

Iterative learning

Using high-quality training data that may elicit richer learning signals (e.g. explanations) has been shown to significantly improve SLM’s abilities in acquiring skills that had only emerged before at much larger scale.

This paradigm fits under a teacher-student approach where the large model (the teacher) is 
creating demonstrations for the SLM (the student) to learn from. In this work, we extend the teacher-student paradigm to iterative learning settings as follows:

  • Teaching by demonstration: In this stage, we train the SLM by using AgentInstruct to demonstrate problems and their solutions.
  • Practice and feedback: We let the SLM practice solving problems on its own. For every problem, we allow the SLM to create multiple solutions. We then use the teacher model to provide feedback on the SLM solutions. If the SLM is unable to correctly solve the problem, even after multiple attempts, we use a solution provided by the teacher.
  • Iterative improvement: We use the teacher feedback to create preference data showing the SLM both good and bad solutions to the same problem, and then retrain the SLM.

The practice, feedback, and iterative improvement steps can be repeated multiple times.

Conclusion

Our findings show that smaller models are valuable in specialized settings where they can match the performance of much larger models but with a limited scope. By training Orca-Math on a small dataset of 200,000 math problems, we have achieved performance levels that rival or surpass those of much larger models.

The relatively small size of the dataset also shows the potential of using multi-agent flows to simulate the process of data and feedback generation. The small dataset size has implications for the cost of training and highlights that training data with richer learning signals can improve the efficiency of the learning process. Our findings also highlight the potential of continual learning and the improvement of language models, where the model iteratively improves as it receives more feedback from a person or another model.

The post Orca-Math: Demonstrating the potential of SLMs with model specialization appeared first on Microsoft Research.

Read More

ViSNet: A general molecular geometry modeling framework for predicting molecular properties and simulating molecular dynamics

ViSNet: A general molecular geometry modeling framework for predicting molecular properties and simulating molecular dynamics

Figure 1. The general model architecture of ViSNet. (a) Model sketch of ViSNet. ViSNet embeds the 3D structures of molecules and extracts the geometric information through a series of ViSNet blocks and outputs the molecule properties such as energy, forces, and HOMO-LUMO gap through an output block. (b) Flowchart of one ViSNet Block. One ViSNet block consists of two modules: i) Scalar2Vec, responsible for attaching scalar embeddings to vectors.; ii) Vec2Scalar. The inputs of Scalar2Vec are the node embedding, edge embedding, direction unit and the relative positions between two atoms.

Molecular geometry modeling is a powerful tool for understanding the intricate relationships between molecular structure and biological activity – a field known as structure-activity relationships (SAR). The main premise of SAR is that the biological activity of a molecule is dictated by its specific chemical structure, not only the connections between nuclei but also how the molecule is twisted and arranged in a three-dimensional configuration. The holy grail in SAR is to be able to predict how molecular configurations influence vital processes such as drug interactions, chemical reactivity, and protein functionality. If this were possible, scientists could predict the efficacy of a drug, as well as its side effects and toxicity, long before it is ever tested on people.

The vector-scalar interactive graph neural network (ViSNet) framework, developed by Microsoft, is a novel approach to molecular geometry modeling. ViSNet is designed to help researchers predict molecular properties, simulate molecular dynamics, and gain a more precise understanding of structure-activity relationships. As a result, ViSNet has the potential to help transform drug discovery, materials science, and other critical fields.

Our research aims to improve the interpretability of molecular data, reduce computing costs, and evaluate real-world application utility. “Enhancing geometric representations for molecules with equivariant vector-scalar interactive message passing” was published in Nature Communications (opens in new tab) in January 2024 and selected for “Editors’ Highlights” in both the “AI and machine learning (opens in new tab)” and “biotechnology and method (opens in new tab)” categories.

MICROSOFT RESEARCH PODCAST

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI.


Geometry deep learning and SAR

Geometry deep learning is a method at the forefront of SAR investigations: a powerful computational approach that harnesses the power of deep-learning techniques to analyze and understand the three-dimensional structures of molecules. Traditional deep-learning methods primarily focus on processing data organized in grid-like structures, such as images or sequences of text. However, molecules are inherently three-dimensional entities with complex geometries, making them challenging to analyze using conventional deep-learning approaches. Geometry deep learning addresses this challenge by building specialized architectures and algorithms capable of handling three-dimensional data. These methods enable computers to learn and extract meaningful features from the spatial arrangement of atoms within molecules, capturing crucial information about their structure and behavior. 

Despite significant recent strides in geometry deep learning, however, challenges persist. These include:  

  • Insufficient molecular interpretability – We are limited in our ability to understand and interpret the inner workings of deep neural networks when applied to molecular geometry modeling. While these networks excel at making predictions based on large datasets and complex patterns, they often operate as “black boxes,” meaning the rationale behind their predictions isn’t always understandable or transparent. In the context of molecular geometry, this lack of interpretability poses challenges in comprehending why certain molecular structures lead to specific outcomes, such as biological activity or chemical reactivity. 
  • Rapidly increasing computing costs as molecular size increases – As molecules increase in size and complexity, the computational resources required to analyze them escalate dramatically. This challenge becomes particularly pronounced when employing advanced computational techniques, such as those using high-order Clebsch–Gordan coefficients. The Clebsch–Gordan coefficients are mathematical quantities used in quantum mechanics to describe the coupling of the angular momentum properties of particles. In the context of molecular modeling, these coefficients are employed in sophisticated quantum mechanical calculations to help account for the interactions between electrons and nuclei within a molecule. For large molecules, the number of atoms and electrons involved increases exponentially, resulting in an astronomical number of possible interactions that must be considered. As a result, the calculations involving high-order Clebsch–Gordan coefficients become tremendously complex and computationally demanding. 
  • Need for blind tests and evaluations in real applications – Assessing predictive models in real-world applications through blind tests is crucial for evaluating their reliability and applicability beyond controlled benchmarks. However, challenges arise due to the scarcity of diverse and representative datasets, and complex system dynamics. There are also ethical considerations in animal and human trials, which naturally restrict the availability of such data. Overcoming these challenges requires interdisciplinary collaboration, innovative methodologies, and transparent validation frameworks to ensure the robustness and trustworthiness of predictive models in addressing real-world challenges. 

Enhancing molecular geometry representations by ViSNet 

Originally, our goal was to develop a model capable of effectively harnessing the intricate structures of molecules. Traditional molecular dynamics (MD) simulations track molecular movements by considering factors like bond length, bond angle, and dihedral angles. Taking inspiration from these methods, we introduced a novel approach called the vector-scalar interactive graph neural network (ViSNet).

Instead of directly integrating bond angle and dihedral information into our model in a straightforward manner, we introduced a concept termed “direction units.” These units represent nodes within the molecular structure as vectors, calculated by summing normalized vectors pointing from the central node to its neighboring nodes. We expanded traditional calculations of bond length, bond angle, and dihedral angles into interactions involving pairs of atoms (two-body), triplets of atoms (three-body), and quadruplets of atoms (four-body). To efficiently manage these interactions, we devised a runtime geometry calculation (RGC) module, which accurately captures the complex relationships between atoms in a molecule. Remarkably, the RGC module’s computations for three-body and four-body interactions exhibit linear time complexity, ensuring computational efficiency.   

Additionally, we introduced a mechanism known as vector-scalar interactive message passing (ViS-MP), facilitating the exchange of information between nodes and edges in the molecular graph. This mechanism iteratively updates the direction units of nodes based on scalar representations of nodes and edges, and vice versa, through the RGC module. These distinctive features of the RGC and ViS-MP significantly enhance our model’s capacity to encode molecular geometry and streamline the process of information exchange within the molecular graph neural network.

Figure 1. The general model architecture of ViSNet. (a) Model sketch of ViSNet. ViSNet embeds the 3D structures of molecules and extracts the geometric information through a series of ViSNet blocks and outputs the molecule properties such as energy, forces, and HOMO-LUMO gap through an output block. (b) Flowchart of one ViSNet Block. One ViSNet block consists of two modules: i) Scalar2Vec, responsible for attaching scalar embeddings to vectors.; ii) Vec2Scalar. The inputs of Scalar2Vec are the node embedding, edge embedding, direction unit and the relative positions between two atoms.
Figure 1. The general model architecture of ViSNet.

ViSNet in real-world applications for molecular modeling and property predictions

To gauge ViSNet’s practical utility, we rigorously evaluated its performance using established benchmarks for predicting molecular properties. Across a range of datasets, including MD17, revised MD17, MD22, QM9, and Molecule3D, ViSNet consistently outperformed existing algorithms, showcasing its exceptional accuracy in representing molecular geometry.

We then put ViSNet to the test by simulating the behavior of the Chignolin protein through molecular dynamics (MD) simulations. Trained on the AIMD-Chig dataset, featuring protein data calculated using advanced density functional theory (DFT) methods, ViSNet outperformed traditional empirical force fields and showed promise when compared to contemporary machine-learning force fields. Notably, simulations with ViSNet closely mirrored outcomes from rigorous DFT calculations, highlighting its potential for precise and efficient data simulations.

We used ViSNet to participate in the First Global AI Drug Development Competition (opens in new tab), an international competition to predict the inhibitors against the main protease of SARS-CoV-2, given the sequence information (i.e., SMILES) of small molecules. Worldwide, 1,105 participants from 878 teams took part in the competition. ViSNet helped us win the competition, demonstrating its promising prediction accuracy. 

Figure 2. ViSNet in the PyTorch Geometric Library. A PyTorch module that implements the equivariant vector-scalar interactive graph neural network (ViSNet) from the “Enhancing Geometric Representations for Molecules with Equivariant Vector-Scalar Interactive Message Passing” paper.
Figure 2. ViSNet in the PyTorch Geometric Library.

To make ViSNet more accessible and user-friendly, Microsoft has integrated it into the PyTorch Geometric Library (opens in new tab) as a core model for molecular modeling and property prediction. This integration aims to broaden the scope of applications and simplify the usage of ViSNet for researchers and practitioners. Additionally, to ensure ongoing support and improvement, a regularly updated version of ViSNet is now available on GitHub (opens in new tab), providing users with the latest enhancements.

Recognizing the potential limitations of graph neural networks, such as the risk of “over-smoothing” (i.e., making nodes indistinguishable from one another) as models grow larger and more complex, we developed a Transformer-based version of ViSNet known as Geoformer (short for Geometric Transformer). This novel variant, introduced in our publication at NeurIPS 2023 (opens in new tab), addresses scalability challenges by transferring the key components of ViSNet into the Transformer architecture. This includes incorporating the RGC module into the Transformer attention mechanism and introducing a new method called interatomic positional encoding (IPE) to capture spatial relationships between atoms.

Figure 3. The overall pipeline of AI2BMD (see demos at https://microsoft.github.io/AI2BMD/index.html).  Proteins are divided into protein units by fragmentation process. The AI2BMD potential is designed based on ViSNet, and the datasets are generated at DFT level. It calculates the energy and atomic forces for the whole protein. The AI2BMD simulation system is built upon all these components and provides a generalizable solution to perform simulations for various proteins. It makes ab initio accuracy on energy and force calculations. By comprehensive analysis from both kinetics and thermodynamics, AI2BMD exhibits good alignments with wet-lab experiment data and detects different phenomenon compared with molecular mechanics.
Figure 3. The overall pipeline of AI2BMD (see demos at https://microsoft.github.io/AI2BMD/index.html (opens in new tab)). 

Looking forward: Toward AI-powered MD simulations with ab initio accuracy

As a crucial component of the AI-powered Ab Initio Molecular Dynamics (AI2BMD) project (opens in new tab), ViSNet plays a pivotal role in accelerating molecular dynamics simulations. The project’s primary objective is to enhance the accuracy and efficiency of these simulations, with the aim of achieving results comparable to those obtained through rigorous ab initio methods, even for large molecular systems. 

By integrating ViSNet into AI2BMD, significant strides have been made toward achieving this goal. ViSNet enables AI2BMD to achieve levels of accuracy in energy and force calculations that closely approach those of ab initio methods, even for complex proteins containing over 10,000 atoms. By leveraging ViSNet in protein dynamics simulations, AI2BMD aims to enhance the precision of free energy estimations and provide valuable insights into protein folding thermodynamics. 

ViSNet’s contributions extend beyond energy calculations to the characterization of various protein properties. These insights have the potential to complement experimental research efforts by offering predictive capabilities and guiding further investigations into protein structure and function. The advancements in molecular geometry modeling, demonstrated by the innovative ViSNet framework, portend a new era of precision and efficiency in computational chemistry and biophysics.  

Through meticulous design and rigorous validation, ViSNet has emerged as a versatile tool capable of giving insight into the intricate relationships between molecular structure and biological activity – getting us one step closer to the holy grail of structure-activity relationships. The integration of ViSNet into established libraries and frameworks, coupled with ongoing research efforts to enhance scalability and accuracy, underscores its potential to revolutionize drug discovery, materials science, and more.

The post ViSNet: A general molecular geometry modeling framework for predicting molecular properties and simulating molecular dynamics appeared first on Microsoft Research.

Read More