Microsoft at ACM SIGCOMM 2023: Innovating the future of networking

Microsoft at ACM SIGCOMM 2023: Innovating the future of networking

Innovating the future of networking

Modern applications heavily rely on robust network infrastructure, requiring continuous innovation. In this evolving landscape, Microsoft is at the forefront, spearheading innovation efforts in networking and strengthening the foundational network infrastructure that underpins the cloud ecosystem. By investing in and enhancing this critical infrastructure, Microsoft not only ensures the resilience and scalability of cloud services but also lays the groundwork for the sophisticated and transformative applications that will continue to define the technological landscape.

ACM SIGCOMM (opens in new tab), the premier annual conference of the Association for Computing Machinery’s special interest group on data communication (opens in new tab) (SIGCOMM), is dedicated to the study of communication and computer networks. Microsoft was proud to be a Gold Sponsor of this year’s conference, publishing 10 papers and participating in the organizing committee. Dave Maltz (opens in new tab), technical fellow and corporate vice president of Azure Networking, served as one of the program committee chairs, helping to oversee the conference’s technical program. Additionally, we are proud to acknowledge the significant achievement of one of our youngest researchers, Siva Kakarla (opens in new tab), recognized as the ACM SIGCOMM Dissertation Award (opens in new tab) runner up for his thesis, “Formal Methods for a Robust Domain Name System (opens in new tab).”  

Microsoft also had a booth showcasing some of our latest technologies, including hollow core biber-based connectivity, SoNIC on smart switches, container networking, technologies for L3/L4-based DDoS protection, and technologies that we are building to extend the cloud into space—for both earth observation and satellite communication.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.


Paper highlights 

The papers Microsoft published at SIGCOMM 2023 span a wide spectrum of networking domains, ranging from 5G and wide area networks (WAN) to enterprise networks. They also explore various aspects of networking, including traffic engineering, network offload strategies, and specialized network designs tailored for applications like gaming, video conferencing, and financial services.   

Here are some of the highlights:

Switchboard: Efficient Resource Management for Conferencing Services 

Efficient resource management is crucial for conferencing services, such as Microsoft Teams, to balance user experience and cost-effectiveness. This involves optimizing the allocation of media processing servers, responsible for handling media streams during calls. Rahul Bothra, Rohan Gandhi, Ranjita Bhagwan, Venkat Padmanabhan, Rui Liang, Steve Carlson, Vinayaka Kamath, Sreangsu Acharyya, Ken Sueda, Somesh Chaturmohta, and Harsha Sharm introduce Switchboard, a significant advancement in resource management controllers. Switchboard is peak-aware, recognizing that resource costs vary with peak usage times and across time zones, allowing servers to serve calls during peak times and act as backups during off-peak hours. Additionally, it enhances efficiency by coordinating network and compute provisioning and application-aware resource allocation. Evaluation using Microsoft Teams data demonstrates that Switchboard reduces provisioning costs by up to 51 percent while maintaining or improving latency compared to existing solutions.

Resilient Baseband Processing in Virtualized RANs with Slingshot 

In the realm of cellular networks, virtualized radio access networks (vRANs) are gaining prominence, replacing traditional specialized hardware with software on commodity servers. However, current vRAN setups lack resilience, making it challenging to implement failover mechanisms and upgrades without prolonged service interruptions. Nikita Lazarev, Tao Ji, Anuj Kalia, Daehyeok Kim, Ilias Marinos, Francis Y. Yan, Christina Delimitrou, Zhiru Zhang, and Aditya Akella propose Slingshot, an innovative system designed to seamlessly introduce resilience to the most critical layer of vRANs, the physical layer (PHY). Slingshot accomplishes this by employing novel techniques for real-time workload migration, incorporating fast RAN protocol middleboxes, and implementing real-time RAN failure detection. A key breakthrough in Slingshot’s design is its approach to treat transient disruptions from resilience events as akin to regular wireless signal impairments, using the inherent resilience of cellular networks to these occurrences. Experiments conducted on a cutting-edge 5G vRAN testbed demonstrate Slingshot’s capability to manage PHY failover without interrupting video conferencing and causing under 110 microseconds of disruption to a TCP connection. Furthermore, it enables seamless zero-downtime upgrades in vRAN deployments.

DBO: Response Time Fairness for Cloud-Hosted Financial Exchanges 

When hosting financial exchanges in cloud environments, ensuring equal and predictable latency for all market participants is critical, especially in tasks like high-speed trading. Existing cloud deployments often struggle to maintain such fairness due to factors like congestion and varying network paths. In this paper, Prateesh Goyal, Eashan Gupta, Ilias Marinos, Chenxingyu Zhao, Radhika Mittal, and myself (Ranveer Chandra), tackle the issue arising from the lack of determinism in cloud networks, showing that achieving predictable or bounded latency isn’t a necessity to ensure fairness. Inspired by the concept of logical clocks in distributed systems, the paper introduces Delivery Based Ordering (DBO) as a novel approach to rectifying latency discrepancies among participants, helping ensure fairness. The evaluation of DBO, conducted both in a hardware testbed and a public cloud environment, demonstrates its feasibility in achieving guaranteed fairness and sustaining sub-100 microsecond latency, even at high transaction rates.

For the complete list of accepted publications by Microsoft researchers, please see the publications list on Microsoft at SIGCOMM 2023.

a group of researchers attending SIGCOMM 2023. They are standing in front of multiple buildings.

Learn about opportunities

Microsoft welcomes talented individuals across various roles at Microsoft Research, Azure Networking, and other departments. Whether you’re a networking partner or researcher, we welcome your collaboration and exploration to advance computer networking and invite you to be part of the team crafting cutting-edge solutions for industry challenges. Review our open positions at the Microsoft Research website.

The post Microsoft at ACM SIGCOMM 2023: Innovating the future of networking appeared first on Microsoft Research.

Read More

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

MSR Podcast | AI Frontiers | Ahmed Awadallah

Episode 149 | Sept. 14, 2023

Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.  

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as healthcare and education, and its potential to benefit humanity.

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI. Awadallah discusses the shift in dynamics between model size and amount—and quality—of data when it comes to model training; the recently published paper “Orca: Progressive Learning from Complex Explanation Traces of GPT-4,” which further explores the use of large-scale AI models to improve the performance of smaller, less powerful ones; and the need for better evaluation strategies, particularly as we move into a future in which Awadallah hopes to see gains in these models’ ability to continually learn.

Transcript

[MUSIC PLAYS]

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more inspired to work in the field than right now. The release of GPT-4 was a watershed moment in the pursuit of artificial intelligence, and yet progress continues to accelerate. The latest large-scale AI models and the systems they power are continuing to exhibit improvements in reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’m sharing conversations with fellow researchers about the latest developments in large-scale AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.

Today, I’ll speak with Ahmed Awadallah. Ahmed is a Senior Principal Researcher at Microsoft Research in Redmond. Much of his work focuses on machine learning, helping to create foundation models that excel at key tasks while using less compute and energy. His work has been at the leading edge of recent progress in AI and gives him a unique perspective on where it will go next.


[MUSIC FADES] 

All right, Ahmed, let’s dive right in. Among other things, I find that people are hungry to understand the drivers of the progress we’re seeing in AI. Over these last few years when people like you or I have tried to explain this, we’ve often pointed to some measure of scale. You know, I know many times as I’ve given talks in AI, I’ve shown plots that feature some kind of up-and-to-the-right trend in scale over time—the increasing size of the AI models we’re training, the increasing size of the datasets we’re using to train them on, or even the corresponding increase in the overall compute budget. But when you double-click into this general notion of scale related to large AI models, what gets exposed is really a rapidly evolving frontier of experimental science. So, Ahmed, I’m going to start with a big question and then we can kind of decompose it from there. As someone at the forefront of all of this, how has your understanding of what’s driving progress in AI changed over this last year?

AHMED AWADALLAH: Thanks, Ashley. That’s a very good question. And the short answer is it’s changed a lot. I think I have never been learning as much as I have been throughout my career. Things are moving really, really fast. The progress is amazing to witness, and we’re just learning more and more every day. To your point, for quite some time, we were thinking of scale as the main driver of progress, and scale is clearly very important and necessary. But over the last year, we have been also seeing many different things. Maybe the most prominent one is the importance of data being used for training these models. And that’s not very separate from scale, because when we think about scale, what really matters is how much compute we are spending in training these models. And you can choose to spend that compute in making the model bigger or in training it on more and more data, training it for longer. And it has been over the past few years a lot of iterations in trying to understand that. But it has been very clear over the last year that we were, in a sense, underestimating the value of data in different ways: number one, in having more data but even more important, the quality of the data, having cleaner data, having more representative data, and also the distribution or the mixing of the data that we are using. Like, for example, one of the very interesting things we have witnessed maybe over the last year to year and a half is that a lot of the language models are being trained on text and code. And surprisingly, the training on code is actually helping the model a lot—not just in coding tasks but in normal other tasks that do not really involve coding. More importantly, I think one of the big shifts last year in particular—it has been happening for quite some time but we have been seeing a lot of value for it last year—is that there are now like two stages of training these models: the pretraining stage, where you are actually training the language model in an autoregressive manner to predict the next word. And that just makes it a very good language model. But then the post-training stage with the instruction tuning and RLHF (reinforcement learning from human feedback) and reward models, using a very different form of data; this is not self-supervised, freely available data on the internet anymore. This is human-generated, human-curated, maybe a mixture of model- and human-curated data that’s trying to get the model to be better at very specific elements like being more helpful or being harmless. 

LLORENS: There’s so much to unpack even in that, in that short answer. So let’s, let’s dig in to some of these core concepts here. You, you teed up this notion of ways to spend compute, you know, ways to spend a compute budget. And one of the things you said was, you know, one of the things we can do is make the model bigger. And I think to really illustrate this concept, we need to, we need to dig in to what that means. One, one concept that gets obfuscated there a little bit is the architecture of the model. So what does it mean to make the model bigger? Maybe you can tell us something about, you know, how to think about parameters in the model and how important is architecture in that, in that conversation.

AWADALLAH: So most of the progress, especially in language and other domains, as well, have been using the transformer model. And the transformer model have been actually very robust to change over the years. I don’t … I think a lot … I’ve asked a lot of experts over the years whether they had expected the transformer model to be still around five, six years later, and most of them thought we would have something very different. But it has been very robust and very universal, and, yes, there have been improvements and changes, but the core idea has still been the same. And with dense transformer models, the size of the model tends to be around the number of layers that you have in the model and then the number of parameters that you have in each layer, which is basically the depths and the widths of the model. And we have been seeing very steady exponential increase in that. It’s very, it’s very interesting to think that just like five years ago when BERT came up, the large model was like 300-something million parameters and the smaller one was 100 million parameters. And we consider these to be really large ones. Now that’s a very, very small scale. So things have been moving and moving really fast in making these models bigger. But over the time, there started to be an understanding being developed of how big should the model be. If I were to invest a certain amount of compute, what should I do with that in terms of the model size and especially on how it relates to the data side? And, perhaps, one of the most significant efforts there was the OpenAI scaling laws, which came up in 2020, late 2020, I think. And it was basically saying that if you are … if you have 10x more compute to spend, then you should dedicate maybe five of that … 5x of that to making the model bigger—more layers, more width—and maybe 2x to making the data bigger. And that translated to … for, like say, GPT-3-like model being trained on almost 300 billion tokens, and for quite some time, the 300 billion tokens was stuck, like it became the standard, and a lot of people were using that. But then fast-forward less than two years later came the second iteration of the scaling laws, the Chinchilla paper, where the, the recommendation was slightly different. It was like we were not paying enough attention to the size of the data. Actually, you should now think of the data and the size as equally … and the size of the model … as equally important. So if you were to invest in X more, you should just split them evenly between bigger models and more data. And that was quite a change, and it actually got all the people to pay more attention to the data. But then fast-forward one more year, in 2023—and maybe pioneered mostly with the Llama work from Meta and then many, many others followed suit—we started finding out that we don’t have to operate at this optimal point. We can actually push for more data and the model will continue to improve. And that’s interesting because when you are thinking about the training versus the deployment or the inference parts of the life cycle of the model, they are actually very different. When you are training the model, you would like the model to learn to generalize as best as possible. When you are actually using the model, the size of the model becomes a huge difference. I actually recall an interesting quote from a 2015 paper by Geoff Hinton and others. That’s the paper that introduced the idea of distillation for neural networks. Distillation was there before from the work of, of Rich Caruana, our colleague here at Microsoft, and others. But in 2015, there was this paper specifically discussing distilling models for neural network models, and one of the motivating sentences at the very beginning of the paper was basically talking about insects and how insects would have different forms throughout their life cycles. At the beginning of their life, they are optimized for extracting energy and nutrients from the environment, and then later on, in their adult form, they have very different forms as optimized for flying and traveling and reproduction and so on and so forth. So that, that analogy is very interesting here because like you can think about the same not just in the context of distillation, as this paper was describing, but just for pretraining the models in general. Yes, the optimal point might have been to equally split your compute between the data and the size, but actually going more towards having more and more data actually is beneficial. As long as the model is getting better, it will give you a lot more benefit because you have a smaller model to use during the inference time. And we would see that with the latest iteration of the Llama models, we are now seeing models as small as 7 billion parameters being trained on 1 to 2 trillion tokens of data, which was unheard before.

LLORENS: Let’s talk a bit more about evaluating performance. Of course, the neural scaling laws that you referenced earlier really predict how the performance of a model on the task of next word prediction will improve with the size of the model or the size of the data. But of course, that’s not what we really care about. What we’re really after is better performance on any number of downstream tasks like reasoning, document summarization, or even writing fiction. How do we predict and measure performance in that broader sense? 

AWADALLAH: Yeah, that’s a very good question. And that’s another area where our understanding of evaluating generative models in general has been challenged quite a bit over the last year in particular. And I think one of the areas that I would recommend to spend a lot of time working on right now is figuring out a better strategy around evaluating generative language models. We … this field has been very benchmark driven for many, many years, and we have been seeing a lot of very well-established benchmarks that have been helping the community in general make a lot of progress. We have seen leaderboards like GLUE and SuperGLUE, and many, many others play a very important role in the development of pretrained models. But over the last year, there has been a lot of changes. One is that these benchmarks are being saturated really, really quickly. There was … this paper that I was reading a few, reading a few months back talking about how we went from times where benchmarks like Switchboard and MNIST for speech and image processing lasted for 10 to 20 years before they get saturated to times where things like SQuAD and GLUE and SuperGLUE are getting saturated in a year or two to now where many of the benchmarks just get like maybe two or three submissions and that’s it. It gets saturated very quickly after that. BIG-Bench is a prime example of that, where it was like a collaborative effort, over 400 people coming together from many different institutions designing, a benchmark to challenge language models. And then came GPT-4, and we’re seeing that it’s doing really, really, really well, even in like zero-shot and, and, and few-shot settings, where the tasks are completely new to the models. So the model out of the box is basically solving a lot of the benchmarks that we have. That’s an artifact of the significant progress that we have been seeing and the speed of that progress, but it’s actually making that, that answer to that question even harder. But there’s another thing that’s making it even harder is that the benchmarks are giving us a much more limited view of the actual capabilities of these models compared to what they can actually do, especially models like GPT-4. The, the breadth of capabilities of the model is beyond what we had benchmarks to measure it with. And we have seen once it was released, then once people started interacting with it, there are so many experiences and so many efforts just thinking about what can we do with that model. Now we figured out that it can do this new task; it can do that new task. I can use it in this way that I didn’t think about before. So that expansion in the surface of capabilities of the model is making the question of evaluating them even, even harder and, and moving forward, I think this would be one of the most interesting areas to really spend time on.

LLORENS: Why don’t we talk a bit about a paper that you recently published with some Microsoft Research colleagueS called “Orca: Progressive Learning from Complex Explanation Traces of GPT-4.” And there’s a couple of, of concepts that we’ve been talking about that I want to pull through to, to a discussion around, around this work. One is the idea of quality of data. And so it would be great to hear, you know, some of the intuitions around … yeah, what, what drove you to focus on data quality versus, you know, number of parameters or number of tokens? And then we can also come back to this notion of benchmarks, because to publish, you have to pick some benchmarks, right? [LAUGHS] So, so first, why don’t we talk about the intuitions behind this paper and what you did there, and then I’d love to understand how you thought through the process of picking benchmarks to evaluate these models. 

AWADALLAH: Yeah, so, so in this paper, we were basically thinking about like … there has been a lot of work actually on thinking about how do we have a very powerful model and use it to improve a less powerful model. This is not a new concept. It has been there forever, and I mentioned the Hinton et al. paper on distillation, one of the pioneer papers applying that to neural networks. And over time, this field actually continued getting better and better. And the way the large, more powerful models were used just continued evolving. So people were using the logits generated by the model and then maybe looking at intermediate layers and their output, maybe looking at attention maps and trying to map that between the models and coming up with more and more complex ways of distilling information from the powerful model to improve a less powerful model. But with models like GPT-4, we were thinking that GPT-4 is so good that you can actually start thinking about different ways of having a model teaching another model. And in that particular case, the idea was, can we actually have the powerful model explain in step by step how to do the task, and can we actually have a smaller model learn from that? And how far can this actually help the smaller one? A big part of this has to do with the data quality but also with the teacher model quality. You wouldn’t be able to … and this gets us into the whole notion of synthesized data and the role of synthesized data can play in making models better. Models like GPT-4, the level of capability where you could actually generate a lot of synthetic data at a very high quality comparable in some cases to what you’d get from a human, better in some cases than what you could get from a human. And even more than that, when you are working with a model like GPT-4, there has been a lot of work over the last few months demonstrating that you can even get the model to be a lot better by having the model reflect on what it’s doing, having the model critique what it’s doing and try to come up with even corrections and improvements to its own generation. And once you have this going, you see that you can actually create very high-quality synthetic data in so many ways, mostly because of the quality of the model but also because of like these different ways of generating the data on top of the model. And then it was really an experiment of how far can another model learn from these models. And by the way—and there is … we’re seeing some work like that, as well—it doesn’t even have to be a different model. It can be the same model improving itself. It can be the same model giving feedback to itself. That coincided with actually us having, having … we have been spending a lot of time thinking about this idea of learning from feedback or like continual improvement. How can we take a language model and continue to improve it based on interaction, based on feedback? So we started connecting these two concepts and basically thinking of it like the powerful model is just giving feedback to our much less powerful model and trying to help it improve across certain dimensions. And that’s where that line of work started. And what we were finding out is that you can actually have the more powerful model teach a smaller model. It would have definitely much narrower capabilities than the bigger model because like by virtue of this training cycle, you are just focused on teaching it particular concepts. You cannot teach it everything that the larger model can do. But also because this is another example of this like post-training step, like this model has already been pretrained language model and it’s always limited by the basic capabilities that it has. So, yes, the large language model can teach it a little bit more, but it will always be limited by that.

LLORENS: Now you mentioned … you’ve sketched out now the idea of using a powerful general-purpose model through some process of distillation to train a, a smaller, more special, more specialized model. And in the paper, you, you and your colleagues offer a number of case studies. So can you, can you pick one? Give, give us, you know, give us an example of a specialized domain and the way that you utilize GPT-4 to accomplish this training and what the performance outcome was. 

AWADALLAH: Yeah, actually, when we were working on this paper, the team was thinking that what capability should we try to focus on to, to demonstrate that the small model can improve from, from the guidance of the much more powerful model. And we were thinking it would be very cool if we can demonstrate that the small model can get better at reasoning, because reasoning has been one of the capabilities that have been clearly emerging with larger and larger models, and models like GPT-4 demonstrate the level of reasoning that we have never seen with any of our systems before. So we were thinking can we … can, can GPT-4 help actually get the smaller model to be better at reasoning. And that had a lot of implications on the selection of what datasets to use for, for creating the synthetic data. In this particular paper, by the way, we’re not, we’re not using GPT-4 to answer the questions. We already have the questions and the answers. We are just asking GPT-4 to explain it in step by step. This is similar to what we have been seeing with chain-of-thought reasoning, chain-of-thought prompting, and other different prompting techniques showing that if you actually push the language model to go step by step, it can actually do a lot better. So we are basically saying, can we have these explanations and step-by-step traces and have them help the smaller language model learn to reason a little bit better. And because of that, actually—and this goes back to your earlier questions about benchmarks—in this particular paper, we chose two main benchmarks. There were more than two, but like the two main benchmarks where BIG-Bench Hard and AGIEval. BIG-Bench Hard is a 23 subset of BIG-Bench that we were just talking about earlier, and a lot of the tasks are very heavy on reasoning. AGIEval is a set of questions that are SAT-, LSAT-, GRE-, and GMAT-type of questions. They are also very heavy on reasoning. The benchmarks were selected to highlight the reasoning improvement and the reasoning capability of the model. And we had, we had a bunch of use cases there, and you would see one of the common themes there is that there is actually … even before the use cases, if you look at the, the results, the reasoning ability as measured by these two benchmarks at least of the base model significantly improved. Still far behind the teacher. The teacher is much, much more powerful and there’s no real comparison, but still the fact that collecting synthetic data from a model like GPT-4 explaining reasoning steps could help a much smaller model get better at reasoning and get better by that magnitude was a very interesting finding. We were, we were quite a bit surprised, actually, by the results. We thought that it will improve the model reasoning abilities, but it actually improved it beyond what we expected. And again, this goes back to like imagine if we were … if we wanted to do that without a model like GPT-4, that would entail having humans generate explanations for a very large number of tasks and make sure that these explanations remain faithful and align with the answers of the question. It would have been a very hard task, and the type of annotator that you would like to recruit in order to do that, it would have been … even made it harder and slower. But having, having the capabilities of a model like GPT-4 is really what made it possible to do that.

LLORENS: You’ve, you’ve outlined now, you know, your experiments around using GPT-4 to train a smaller model, but earlier, you also alluded to a pretty compelling idea that maybe even a large, powerful model could, I guess, self-improve by generate, you know, performing a generation, critiquing itself, and then somehow guiding, you know, the parameter weights in a way that, that was informed by the critique. Is that, was that part of these experiments, or what … or, or is that … does that work? [LAUGHS] Have, have we … do we have experimental evidence of that?  

AWADALLAH: Yeah, I think, I think that’s a very good question. That was really how we started. That was really what we were aiming and still trying to do. The value … we started off by asking that question: can we actually have a model self-improve, self-improve itself? From an experimental perspective, it was much easier to have a powerful model help a smaller model improve. But self-improvement is really what we, what got us excited about this direction from the beginning. There has been evidence from other work showing up over the last short period actually showing that this is actually a very promising direction, too. For example, one of the very interesting findings about these powerful models—I think that the term frontier models is being used to refer to them now—is that they have a very good ability at critiquing and verifying output. And sometimes that’s even better than their ability at solving the task. So you can basically go to GPT-4 and ask it to solve a coding question. Write a Python function to do something. And then you can go again to GPT-4 and ask it to look back at that code and see if there are any bugs in there. And surprisingly, it would identify bugs in its own generation with a very high quality. And then you can go back to GPT-4 again and ask it to improve its own generation and fix the bugs. And it does that. So we actually have a couple of experiments with that. One of them in a toolkit called LIDA that one of my colleagues here, Victor [Dibia], has been working on for some time. LIDA is a tool for visualizations, and you basically go there and submit a query. The query would be, say, create a graph that shows the trends of stocks over the last year. And it’ll actually go to the data basically, engineer Python code. The Python code, when compiled and executed, would generate a visualization. But then we were finding out that we don’t have to stop there. We can actually ask GPT-4 again to go back to that visualization and critique it, and it doesn’t have to be open critique. We can define the dimensions that we would like to improve on and ask GPT-4 to critique and provide feedback across these dimensions. Like it could be the readability of the chart. It could be, is the type of chart the best fit for the data? And surprisingly it does that quite well. And then that opens the door to so many interesting experiences where you can, after coming up with the initial answer, you can actually suggest some of these improvements to a human. Or maybe if you are confident enough, you just go ahead and apply them even without involving the human in the loop and you actually get a lot better. There was another experiment like that where another colleague of mine has been working on a library called AutoGen, which basically helps with these iterative loops on top of language models, as well as figuring out values of hyperparameters and so on and so forth. And the experiments were very similar. There was a notion there of like having a separate agent that the team refers to as a user proxy agent, and that agent basically has a criteria of what the user is trying to do. And it keeps asking GPT-4 to critique the output and improve the output up until this criteria is met. And we see that we get much, much better value with using GPT-4 this way. That cycle is expensive, though, because you have to iterate and go back multiple times. The whole idea of self-improvement is basically, can we literally distill that cycle into the model itself again so that as the model is being used and being asked to maybe critique and provide feedback or maybe also getting some critique and feedback from the human user, can we use that data to continue to improve the model itself?

LLORENS: It is pretty fascinating that these models can be better at evaluating a candidate solution to a task than generating a novel solution to the task. On the other hand, maybe it’s not so surprising. One of the things that’s hard about or one of the things that can be challenging is this idea of, you know, prompt engineering, by which I’m trying to specify a task for the, for the model to solve or for the AI system to solve. But if you think about it, the best I can do at specifying the task is to actually try my best to complete the task. I’ve now specified the task to the greatest extent that I possibly can. So the machine kind of has my best task specification. With that, that information, now it becomes a kind of maybe even in some cases a superhuman evaluator. It’s doing better than I can at evaluating my own work. So that’s kind of an interesting twist there. Back, you know, back to the Orca paper, one of the things that you wouldn’t have seen … you know, earlier in the talk, you, you harkened back to say a decade ago, when benchmarks lasted a long, a longer time, one of the things that we would not necessarily have seen in a paper from that era, you know, say the CNN era of AI, is, is, a, is a safety evaluation, you know, for a specialized object recognition model. But in the Orca paper, we do have a safety evaluation. Can you, you talk a little bit about the thought process behind the particular evaluations that you did conduct and, and why these are necessary in the first place in this era of AI? 

AWADALLAH: Yeah, I think in this era of AI, this is one of the most important parts of the development cycle of any LLM—large or small. And as we were just describing, we are discovering abilities of these models as we go. So just as there will be a lot of emerging capabilities that are surprising and useful and interesting, this would also open the door to a lot of misuse. And safety evaluation is at least … is the least we can do in order to make sure that we understand how, how can this model be used and what are some of the possible harms or the possible misuses that can come from using these models? So I think, I think this is, this is now definitely should be a standard for any work on language models. And here we are not, we’re not really training a language model from scratch. This is more of like a post-training or a fine-tuning of an existing language model. But even for, for, for research like that, I think safety evaluation should be a critical component of that. And, yes, we did some, and we, we, we actually have a couple of paragraphs in the paper where we say we need to do a lot more, and we are doing a lot more of that right now. I think … what we did in the paper that … we focused on only two dimensions: truthfulness and toxicity. And we were basically trying to make sure that we are trying to see the additional fine-tuning and training that we do, is it improving the model across these dimensions or is it not? And the good news that it was actually improving it in both dimensions, at least with the benchmarks that we have tried. I, I think it was interesting that actually on the, on the toxicity aspect in particular, we found that this particular type of post-training is actually improving the base model in terms of its tendency to generate toxic or biased content. But I think a big part of that is that we, we’re using Azure APIs in part of the data cleaning and data processing, and Azure has invested a lot of time and effort in making sure that we have a lot of tools and classifiers for identifying unsafe content, so the training data, the post-training data, benefited from that, which ended up helping the model, as well. But to your point, I think this is a critical component that should go into any work related to pretraining or post-training or even fine-tuning in many cases. And we did some in the paper, but I think, I think there’s a lot more to be done there. 

LLORENS: Can you talk a little bit more about post-training as distinct from pretraining? How that, how that process has evolved, and, and where you see it going from here?

AWADALLAH: I, I, I see a ton of potential and, and opportunity there actually. And pretraining is the traditional language model training as we have always done it. Surprisingly, actually, if you go back to … like I, I was … in, in one of the talks, I was showing like a 20-years-ago paper by Bengio et al. doing the language model training with neural networks, and we’re still training neural networks the same way, autoregressive next word prediction. Very different architecture, a lot of detail that goes into the training process, but we are still training them as a language model to predict the next word. In a big departure from that—and it started with the InstructGPT paper and then a lot of other work had followed—there was this introduction of other steps of the language model training process. The first step is instruction tuning, which is showing the model prompts and responses and asking it to … and training the model on these prompts and responses. Often these responses are originated by a human. So you are not just training the model to learn the language model criteria only anymore, you are actually training it to respond to a way the human would want it to respond. And this was very interesting because you could see that the language models are really very good text-completion engines. And at some time actually, a lot of folks were working on framing the task such that it looks like this text completion. So if you are doing classification, you would basically list your input and then ask a question where the completion of that question would be the class that you are looking for. But then the community started figuring out that you can actually introduce this additional step of instruction tuning, where now out of all the possible ways of completing a sentence like if I’m asking a question, maybe listing other similar questions is a very good way of completion. Maybe repeating that question with more details is another way of completion, or answering the question is a third way of completion, and all of them could be highly probable. The instruction tuning is basically teaching the model the way to respond, and a big part of that has to do with safety, as well, because you could demonstrate how we want the model to be helpful, how we want the model to be harmless, in this instruction-tuning step. But the instruction tuning step is only showing the model what to do. It’s not showing it what not to do. And this is where the RLHF step came in, the reinforcement learning from human feedback. What’s happening really is that instead of showing the model a single answer, we’re showing them a little more than one answer. And we are basically showing them only a preference. We’re basically telling the model Answer A is better than Answer B. It could be better for many reasons. We are just encoding our criteria of better into these annotations, and we are training a reward model first that basically it’s job is, given any response, would assign a scalar value to it on how good it is. And then we are doing the RLHF training loop, where the reward model is used to update the original model such that it learns what are better responses or not or worse responses and tries to align more with the better responses. The post-training is, as a concept, is very related and, and sometimes referred to also as alignment, because the way post-training has been mostly used is to align the model to human values, whether this be being helpful or being harmless. 

LLORENS: Ahmed, as we, as we wrap up here, typically, I would ask something like, you know, what’s next for your research, and maybe you can tell us a little bit about what’s next for your research. [LAUGHS] But, but before you do that, I’d love to understand what, what key limitation you see in the current era of AI that you would … would be on your wish list, right, as something that maybe you and your team or maybe the broader field has accomplished in the next five years. What, what new capabilities would be on your wish list for AI over the next five years? 

AWADALLAH: Yeah, given, given the progress, I would say even much shorter than five years. 

LLORENS: Five months. [LAUGHS]

AWADALLAH: But I would say … actually the answer to the two questions are, are very similar. Actually, I think where we are with these models right now is much better than many people anticipated, and we are able to solve problems that we didn’t think we could solve before. One of the key capabilities that I would like to see getting better over the next, few months to a few years—hopefully more toward few months—is the ability of the model to continue to learn. This like continual learning loop where the model is learning as it interacts with the humans. The model is reflecting on past experiences and getting better as we use it, and maybe also getting better in an adaptive way. Like we sometimes use this term adaptive alignment, where we are basically saying we want the model actually to continue to align and continue to align in the way it behaves across multiple dimensions. Like maybe the model will get more personal as I use it, and it will start acting more and, and behaving more in a way I want it to be. Or maybe I am developing a particular application, and for that application, I want the model to be a lot more creative or I want the model to be a lot more grounded. We can do some of that with prompting right now, but I think having more progress along this notion of continual learning, lifelong learning … this has been a heavily studied subject in machine learning in general and has been the holy grail of machine learning for many, many, many years. Having a model that’s able to continue to learn, continue to adapt, gets better every time you use it, so just when I use it today and I interact with it and it could learn about my preferences, and next time along, I don’t have to state these preferences again. Or maybe when it makes a mistake and I provide a feedback, next time along, it already knows that it had made that mistake and it already gives me a better solution.  

LLORENS: That should have been the last question. But I think I have one more. That is, how will we know that the models are getting better at that, right? That’s a metric that’s sort of driven by interaction versus, you know, static evaluation. So how do you, how do you measure progress in adaptive alignment that way?

AWADALLAH: I think, I think that’s a very interesting point. And this actually ties this back with two concepts that we brought up earlier: the evaluation side and the safety side. Because from the evaluation perspective, I do think we need to move beyond static benchmark evaluation to a more dynamic human-in-the-loop evaluation, and there’s already been attempts and progress at that just over the past few months, and there is still a lot more to do there. The evaluation criteria will not also be universal. Like there will be a lot … like a lot of people talk about the, let’s say, fabrications—the models making up information, facts. Well, if I am using the model to help me write fictional stories, like this becomes a feature; it’s not a bug. But if I’m using the model to ask questions, especially in the high-stakes scenario, it becomes a very big problem. So having a way of evaluating these models that are dynamic, that are human-in-the-loop, that are adaptive, that aligns with objectives of how we are using the models will be a very important research area, and that ties back to the safety angles, as well, because if I … if we are barely … we’re, we’re …  everybody is working really hard to try to understand the safety of the models after the models are being trained and they are fixed. But what if the models continue to improve? What if it’s continuing to learn? What if it’s learning things from me that are different than what it’s learning from you? Then that notion of alignment and safety and evaluation of that becomes also a very open and interesting question.  

LLORENS: Well, look, I love the ambition there, Ahmed, and thanks for a fascinating discussion. 

AWADALLAH: Thank you so much, Ashley.

The post AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens appeared first on Microsoft Research.

Read More

Research Focus: Week of September 11, 2023

Research Focus: Week of September 11, 2023

Microsoft Research Focus 24 | Week of September 11, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

PolySem: Efficient Polyglot Analytics on Semantic Data

Data scientists and data engineers spend a large portion of their time trying to understand, clean and transform their data before they can even start performing meaningful analysis. Most database vendors provide business intelligence (BI) tools as an efficient and user-friendly platform for customers to perform data cleaning, preparation and linking tasks to obtain actionable semantic data. However, customers are increasingly interested in querying semantic data through various modalities including SQL, imperative programming languages such as Python, and natural language queries. Today, customers are limited to using either the visual interfaces provided by these tools or languages that are specific to the particular tool.

In a new paper: PolySem: Efficient Polyglot Analytics on Semantic Data, researchers from Microsoft propose techniques to enable the execution of user queries expressed in different modalities on semantic datasets without having to export data out of the BI system. Their techniques include automatic translation of user queries into a language-agnostic representation of data processing operations, and subsequently into the specific query language that is amenable to execution on the BI engine. Evaluation results on BI and decision support benchmarks suggest significant improvements in query performance compared to other popular data processing engines.

Microsoft Research Podcast

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

Emre Kiciman and Amit Sharma discuss their paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” and how it examines the causal capabilities of large language models (LLMs) and their implications.


NEW RESOURCE

Generative retrieval for conversational question answering

The growth of conversational agents, including voice assistants and chatbots, has led to a shift towards dialogue-based interfaces for information-seeking activities. This has spurred the development of conversational question answering (QA) systems. Effective passage retrieval, which excludes irrelevant data from scanned documents, is crucial but challenging for such systems due to the ambiguity of questions. Current methods rely on the dual-encoder architecture to embed contextualized vectors of questions in conversations. However, this architecture is limited in the embedding bottleneck and the dot-product operation.

To alleviate these limitations, researchers from Microsoft propose generative retrieval for conversational QA (GCoQA). GCoQA assigns distinctive identifiers for passages and retrieves passages by generating their identifiers token-by-token via the encoder–decoder architecture. In this generative way, GCoQA eliminates the need for a vector-style index and could attend to crucial tokens of the conversation context at every decoding step. Experiments on three public datasets containing about twenty million passages show GCoQA achieves relative improvements of +13.6% in passage retrieval and +42.9% in document retrieval. GCoQA also reduces memory usage and improves inference speed.


NEW RESOURCE

BatteryML: An open-source tool for machine learning on battery degradation

In recent years, lithium-ion batteries have become the cornerstone of energy storage solutions, owing to their high energy density, long cycle life, and relatively low self-discharge. They have found widespread applications across various industries, including electric vehicles, consumer electronics, and renewable energy systems. Despite these advantages, lithium-ion batteries face challenges related to capacity degradation and performance optimization, which have become critical areas of focus in battery research.

Capacity degradation is a complex process influenced by various factors such as temperature, charge-discharge rate, and state of charge. Understanding and mitigating these factors is crucial for enhancing the performance and longevity of lithium-ion batteries. This has led to the development of advanced battery management systems and the application of machine learning techniques to improve prediction accuracy and optimize battery performance.

To address these challenges, researchers from Microsoft have released BatteryML (opens in new tab), a comprehensive open-source tool designed specifically for machine learning researchers, battery scientists, and materials researchers with an interest in battery performance prediction and analysis. BatteryML aims to address the challenges of capacity degradation by leveraging machine learning methods to improve various aspects of battery performance, such as capacity fade modeling, state of health prediction, and state of charge estimation.

The post Research Focus: Week of September 11, 2023 appeared first on Microsoft Research.

Read More

Abstracts: September 13, 2023

Abstracts: September 13, 2023

Microsoft Research Podcast - Abstracts

Episode 148 | September 13, 2023

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.  

In the inaugural episode of the series, Dr. Ava Amini and Dr. Kevin K. Yang, both Senior Researchers with Microsoft Health Futures, join host Dr. Gretchen Huizinga to discuss “Protein generation with evolutionary diffusion: Sequence is all you need.” The paper introduces EvoDiff, a suite of models that leverages evolutionary-scale protein data to help design novel proteins more efficiently. Improved protein engineering has the potential to help create new vaccines to prevent disease and new ways to recycle plastics.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract!—of their new and noteworthy papers.

[MUSIC FADES]

Today, I’m talking to Dr. Ava Amini and Dr. Kevin Yang, both senior researchers at Microsoft Health Futures. Ava and Kevin are co-authors of a paper titled “Protein generation with evolutionary diffusion: Sequence is all you need,” and a preprint of the paper is available now on bioRxiv. Ava and Kevin, thanks for joining us on Abstracts

KEVIN YANG: Thanks for having us. 

AVA AMINI: Thank you so much. 

HUIZINGA: So, Kevin, in just a couple sentences, tell us what problem this research addresses and why people should care.


YANG: Yeah, so proteins are this really big, important family of biomolecules, and they’re responsible for a lot of cellular processes. For example, hemoglobin carries oxygen in your blood, and insulin regulates your blood sugar levels. And people are interested in generating new proteins to do things that people care about—not necessarily in our bodies, but we’re interested in proteins as industrial enzymes so for catalysis and to make new chemicals or for therapeutics to make new drugs. And as a step towards this goal, we train a suite of models that we call EvoDiff that learns to generate realistic but novel proteins. So proteins do a lot of useful things in nature, but we can really expand their repertoire to do things that people care about but that nature may not really care about. One really good historical example of this is that most of our modern laundry detergents contain enzymes that break down things that stain your clothes. And these enzymes were based on natural proteins, but natural proteins don’t work under high heat. They don’t work in detergent. So somebody engineered those to work in the conditions of our washing machine. And they work really well nowadays. Looking forward, we look at some of the challenges facing our world, such as sustainability. So some really big things people are working on now are things like enzymes that can break down plastic and help us recycle plastic or enzymes that can perform photosynthesis more efficiently. And then on the other side, there’s therapeutics, and an obvious example there is vaccine design. So designing vaccines quickly and safely for new diseases as they emerge.  

HUIZINGA: Ava, how does your approach build on or differ from what’s been done previously in this field? 

AMINI: Yeah, so we call our approach EvoDiff, and EvoDiff has two components. The first, Evo, refers to evolutionary, and the second, Diff, refers to this notion of diffusion. And the two things that make our approach cool and powerful is the fact that we are leveraging data about proteins that is at an evolutionary scale in terms of the size and the diversity of the datasets about natural proteins that we use. And specifically, we use that data to build a type of AI model that is called a diffusion model. Now, for a little backstory on this, a few years ago, we in the AI community learned that we can do really well in generating brand-new images by taking natural images, adding small amounts of noise to them, corrupting them, and then training an AI model called a diffusion model to remove that noise. And so what we’ve done in this paper is that we have constructed and trained these diffusion models to do the same kind of process on protein data at evolutionary scale. 

HUIZINGA: Kevin, back to you, let’s go a little deeper on methodology. How did you do this?

YANG: Yeah, so we really wanted to do this in a protein sequence space. So in protein biology, you have sequences of amino acids. So that’s a series of amino acid monomers that form a chain, and then that chain folds oftentimes into a 3D structure. And function is usually mediated by that 3D structure. Unfortunately, it’s difficult and can be slow and expensive to obtain experimental structures for all these proteins. And so previous diffusion models of proteins have really focused on generating a three-dimensional structure. And then you can use some other method to find a sequence that will fold to that structure. But what we really wanted to do was generate proteins directly as sequences because it’s much easier to get sequences than it is to get structure. So there’s many, many more sequences out there than there are structures. And we know that deep learning methods scale really well as you increase the size and quality of the datasets they’re trained on. And so we … and by we, it’s me and Ava but also Nitya Thakkar, who was an undergraduate intern last summer with me and Ava, and then Sarah Alamdari, our data scientist, who also did a lot of the hands-on programming for this. And then we also got a lot of help from Rianne van den Berg, who is at AI4Science, and then Alex Lu and Nicolò Fusi, also here in New England. So we went and got these large, diverse, evolutionary datasets of protein sequences, and then we used a deep learning framework called PyTorch to train these diffusion models. And then we do a lot of computational experiments to see whether they do the things we want them to do, which Ava, I think, will talk about next. 

HUIZINGA: Right. Right. So, Ava, yes, what were your major findings?

AMINI: Yeah, the first question we really asked was, can our method, EvoDiff, generate proteins that are new, that are realistic, and that are diverse, meaning they’re not similar to proteins that exist in nature but still are realistic? And so what we found was that indeed, we can do this, and we can do this really well. In fact, the generated proteins from our method show a better coverage of the whole landscape of structural features, functional features, and features in sequence space that exist amongst proteins in nature. And so that was our first really exciting result, that we could generate proteins that were really of high quality using our method. The second thing we asked was, OK, now if we give some context to the model, a little bit of information, can we guide the generation to fulfill particular properties that we want to see in that protein? And so specifically here, we experimented with two types of experiments where first, we can give a part of the protein to the model, let’s say, a part of the protein that binds to another protein. And we hold that part constant and ask the model to generate the sequence around that. And we see that we can do really well on this task, as well. And why that’s important is because it means we can now design new proteins that meet some criteria that we, the users, want the protein to have. For example, the ability to bind to something else. And finally, the last really exciting result was … one point that we’ve talked about is why we want to do this generation in sequence space rather than structure—because structure is difficult, it’s expensive, and there are particular types of proteins that don’t actually end up folding into a final 3D structure. They’re what we call disordered. And these types of disordered proteins have really, really important roles in biology and in disease. And so what we show is that because we do our generation and design in protein sequence space, we can actually generate these types of disordered proteins that are completely inaccessible to methods that rely on using information about the protein’s 3D shape. 

HUIZINGA: So, Kevin, building on Ava’s description there of the structure and sequence space, how is your work significant in terms of real-world impact? 

YANG: Right, so there’s a lot of interest in designing or generating new proteins that do useful things as therapeutics or as industrial catalysts and for a lot of other things, as well. And what our work really does is it gives us a method that can reliably generate high-quality proteins directly in sequence space. And this is good because now we can leverage evolutionary-scale data to do this on any downstream protein engineering problem without relying on a structure-based design or structure-based data. And we’re hoping that this opens up a lot of possibilities for protein engineering, protein design, and we’re really excited about some new experimental work that we—and we hope others—will use to build on this method.

HUIZINGA: Are you guys the first to move into the evolutionary scale in this? Is that a differentiator for your work? 

YANG: So there have been a few other preprints or papers that talk about applying diffusion to protein sequences. The difference here is that, yes, like I said, we’re the first ones to do this at evolutionary scale. So people will also train these models on small sets of related protein sequences. For example, you might go look for an enzyme family and find all the sequences in nature of that family and train a model to generate new examples of that enzyme. But what we’re doing is we’re looking at data that’s from all different species and all different functional classes of proteins and giving us a model that is hopefully universal or as close to universal as we can get for protein sequence space. 

HUIZINGA: Wow. Ava, if there was one thing you want listeners to take away from this work, what would it be? 

AMINI: If there’s one thing to take away, I think it would be this idea that we can and should do protein generation over sequence because of the generality we’re able to achieve, the scale that we’re able to achieve, and the modularity and that our diffusion framework gives us the ability to do that and also to control how we design these proteins to meet specific functional goals. 

HUIZINGA: So, Kevin, to kind of wrap it up, I wonder if you could address what unanswered questions still remain, or unsolved problems in this area, and what’s next on your research agenda. 

YANG: So there’s kind of two directions we want to see here. One is, we want to test better ideas for conditioner models. And what I mean there is we want to feed in text or a desired chemical reaction or some other function directly and have it generate those things that will then go work in the lab. And that’s a really big step up from just generating sequences that work and are novel. And two is, in biology and in protein engineering, models are really good, but what really matters is, do things work in the lab? So we are actually looking to do some of our own experiments to see if the proteins we generate from EvoDiff work as desired in the lab. 

[MUSIC PLAYS]

HUIZINGA: Ava Amini and Kevin Yang, thanks so much for joining us today, and to our listeners, thanks for tuning in. If you’re interested in learning more about the paper, you can find a link at aka.ms/abstracts or you can find a preprint of the paper on bioRxiv. See you next time on Abstracts!

The post Abstracts: September 13, 2023 appeared first on Microsoft Research.

Read More

FP2: Fully In-Place Functional Programming provides memory reuse for pure functional programs 

FP2: Fully In-Place Functional Programming provides memory reuse for pure functional programs 

This research paper was presented at the 28th ACM SIGPLAN International Conference on Functional Programming (opens in new tab) (ICFP), a premier forum for discussing design, implementations, principles, and uses of functional programming.

FP2: Fully In-Place Functional Programming; ICFP 2023

Functional programming languages offer a host of advantages, such as ensuring memory safety (opens in new tab) and eliminating arbitrary side effects. This enables systematic analysis and compositional program construction, facilitating development of scalable and complex software systems. However, a drawback of functional programming is its tendency to liberally allocate new memory. We believe this characteristic has impeded widespread adoption in performance-critical domains. How can we overcome this limitation and harness the benefits of functional programming while maintaining efficient memory usage? 

To illustrate the issue, let’s examine the well-known functional program to reverse a list in linear time using an accumulating parameter:

FP2: Fully In-Place Functional Programming - reverse list code in Koka

The reversal function is written in Koka (opens in new tab), a functional language developed at Microsoft that implements the techniques described in this blog post. Here, a list is either empty (as Nil) or non-empty as a Cons(head,tail) node, and contains the first element as the head and the rest of the list as the tail

In most functional languages, reversing a list this way allocates a fresh result list in the heap, where a list of integers from 1 to 10 is reversed, as shown in Figure 1.

FP2: Fully In-Place Functional Programming; Fig 1- This illustration shows two single-linked lists. The first single-linked list contains the numbers 6 to 10 and is pointed to by
Figure 1: The list [1..5] has already been reversed into acc, but we still must reverse the list [6..10].

As the list xs is non-empty, we add its first element to our accumulating acc parameter before recursing on the rest of the list xx. As shown in Figure 2, this step allocates a new Cons cell but also leaves the Cons cell of xs to be garbage collected. This is rather wasteful.

FP2: Fully In-Place Functional Programming; Fig 3- This illustration depicts two single-linked lists. The first single-linked list contains the numbers 7 to 10 and is pointed to by
Figure 2: The lists after one step of recursion. The top Cons cell on the left has become garbage, while the top Cons cell on the right is freshly allocated.

Fully in-place functional programming avoids allocation 

Recent developments have made it possible to avoid such allocations. In particular, by using a compiler-guided reference counting algorithm called Perceus, we can reuse objects in place whenever the objects are uniquely referenced at runtime. With such reuse, the reverse function can reverse a unique input list xs in-place without allocating any fresh Cons nodes, essentially switching the tail pointers of xs in-place. However, the dynamic nature of this form of reuse makes it hard to predict its application at runtime.  

In our paper, “FP2: Fully in-Place Functional Programming (opens in new tab),” which we’re presenting at ICFP 2023 (opens in new tab), we describe the new fip keyword. It statically checks that programs like the accumulating reverse function can execute in-place, that is, using constant stack space without needing any heap allocation as long as the arguments are unique.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


Tree traversals and zippers

In fact, many familiar functions and algorithms satisfy our fully in-place criteria. For example, consider a binary tree with all the values at the leaves:

FP2: Fully In-Place Functional Programming - binary tree code in Koka

Now, suppose that we want to navigate through this tree, moving up and down in search of a particular element. You might add parent pointers, but in a functional language, there is an alternative solution originally proposed by Gérard Huet known as the zipper (opens in new tab):

FP2: Fully In-Place Functional Programming - Zipper code in Koka

The zipper stores subtrees along the path from the current node up to the root node. We can define operations on pairs consisting of this type of zipper and the current tree, enabling seamless movement through the tree. For example, the following function uses the zipper to move the focus to the left subtree:

FP2: Fully In-Place Functional Programming - focus on left subtree code in Koka

Here, we move to the left subtree of the current node (if it exists) and extend the zipper data type accordingly. In his 1997, Huet already observed that such zipper operations could be implemented in place:

Efficient destructive algorithms on binary trees may be programmed with these completely applicative primitives, which all use constant time, since they all reduce to local pointer manipulation.

In Koka, we can now make Huet’s intuition precise, where the fip keyword guarantees that left is in place. On closer examination, this might be surprising. While the list reversal example reused a Cons node, here it seems like we may need to garbage collect a Bin constructor and allocate a new BinL constructor. Nonetheless, because both constructors have two fields, the previous Bin memory location can still be reused (only updating the constructor tag). Our paper provides the analysis details that enable this, rooted in the concept of “reuse credits.”

Now, suppose we want to update all the values stored in a tree. Using a zipper, we can do this fully in place. While traversing, the zipper stores input tree fragments in order, using BinL for unvisited and BinR for visited subtrees. Reusing the zipper nodes allows in-order tree mapping without heap or stack usage. The tree map function starts by descending to the leftmost leaf, accumulating unvisited subtrees in BinL. Once we hit the leftmost leaf, we apply the argument function f and work our way back up, recursively processing any unvisited subtrees, as shown in Figure 3.

FP2: Fully In-Place Functional Programming - unvisited subtrees code in Koka

The mutually tail-recursive app and down functions are fully in place. Each matched Bin pairs with BinL, and each BinL with BinR, ultimately leading to BinR pairing with Bin. The definition of tmap may seem somewhat complex, but it is much simpler than its iterative imperative counterpart that uses direct pointer reversal.

FP2: Fully In-Place Functional Programming; Fig 3- An illustration of a binary search tree, where the search path has been pointer-reversed. There are five nodes in total: three leaf nodes and two internal nodes. The first leaf node is the left child of the root and has already been visited. The root node is marked as
Figure 3: The program after visiting the leaf containing f(2) on the given tree. The pointers in the zipper are reversed.

Perspectives and further reading

Koka’s new fip keyword ensures that certain functions do not allocate and only use constant stack space, offering efficient and secure code execution akin to static linear types or Rust’s borrow checker. This introduces a new paradigm for writing programs that are purely functional but can still execute in place. We consider this new technique to be a significant milestone on the path toward using high-level functional programming to develop robust software that delivers both competitive and predictable performance. 

To learn about fully in-place functional programming and the Koka language, start at the Koka homepage (opens in new tab). Koka implements a variety of innovative language features, including algebraic effect handlers and first-class constructor contexts. We encourage readers to continue exploring and experimenting with fully in-place programming. For example, try implementing skew binary heaps (opens in new tab) in Koka. Can you demonstrate fully in-place heap union?

The post FP2: Fully In-Place Functional Programming provides memory reuse for pure functional programs  appeared first on Microsoft Research.

Read More

Understanding social biases through the text-to-image generation lens

Understanding social biases through the text-to-image generation lens

This research paper was presented at the Sixth AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES) (opens in new tab), a premier forum for discussion on the societal and ethical aspects of artificial intelligence.

The rise of text-to-image (T2I) generation has ushered in a new era of innovation, offering a broad spectrum of possibilities for creators, designers, and the everyday users of productivity software. This technology can transform descriptive text into remarkably realistic visual content, empowering users to enrich their work with vivid illustrative elements. However, beneath this innovation lies a notable concern—the potential inclusion of harmful societal biases.

These T2I models create images from the extensive web data on which they had been trained, and this data often lacks representation of different demographic groups and cultures and can even harbor harmful content. When these societal biases seep into AI-generated content, they perpetuate and amplify pre-existing societal problems, reinforcing them and creating a disconcerting cycle that undermines previous and current mitigation efforts.

Representation of gender, race, and age across occupations and personality traits

To tackle this problem, it is essential to rigorously evaluate these models across a variety of demographic factors and scenarios. In our paper, “Social Biases through the Text-to-Image Generation Lens (opens in new tab),” presented at AIES 2023 (opens in new tab), we conduct a thorough analysis for studying and quantifying common societal biases reflected in generated images. We focus on the portrayal of occupations, personality traits, and everyday situations across representations of gender, age, race, and geographical location. 

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


For example, consider images that reinforce societal biases for the roles of CEO and housekeeper. These professions have been extensively studied as examples of stereotypical gender biases—where predominantly men are CEOs and women are housekeepers. For all such cases, we observed three different perspectives: 

  1. Real-world distribution: Relies on labor statistics, presenting distribution across various dimensions, such as gender, race, and age.
  1. Search engine results: Captures the distribution evident in search engine outcomes, reflecting contemporary portrayals. 
  1. Image generation results: Emphasizes the distribution observed in image generation outputs. 

We tested two T2I generators, DALLE-v2 and Stable Diffusion and compared them with 2022 data from the U.S. Bureau of Labor Statistics and results for a Google image search conducted in 2020, examining how women are represented across five different occupations. Notably, the analysis of generation models revealed a significant setback in representational fairness compared with data sourced from the U.S. Bureau of Labor Statistics (BLS) and a web image search (GIS) based on geographically referenced information. Notably, images generated by DALLE-v2 provide minimal representation of women in the professions of CEO and computer programmer. Conversely, in images generated by Stable Diffusion, women are consistently represented in the roles of nurses and housekeepers 100% of the time. Figure 1 illustrates our findings, and Figure 2 shows examples of images generated to show different occupations. 

A chart showing gender representation in percentage for DALLE-v2, Stable Diffusion, Google Image Search 2020, and BLS data. 
Figure 1. Gender representation for DALLE-v2, Stable Diffusion, Google Image Search 2020, and BLS data. 
Examples of generations for the professions of “computer programmer” and “housekeeper” using the DALL-E v2 and Stable Diffusion models.
Figure 2. A sample of the first four images generated for the professions of “computer programmer” and “housekeeper” using the DALL-E v2 and Stable Diffusion models. Notably, one gender is conspicuously absent across a distribution of 500 generated images. 

Even when using basic prompts like “person” without including an occupation, we observed that models can underrepresent certain demographic groups across age, race, and gender. When we analyzed DALLE-v2 and Stable Diffusion, both offered a limited representation of races other than white across a set of 500 generated images. Furthermore, the DALLE-v2 outputs revealed a remarkable lack of age diversity, with over 80% of the images depicting either adults who appeared to be between the ages 18 and 40 or children. This is illustrated in Figure 3.

Three charts showing gender, race, and age distribution as interpreted by human annotators for DALL-E v2 and Stable Diffusion models.
Figure 3. Gender, race, and age distribution as interpreted by human annotators and automated face processing within the context of image generation for the prompt “person.” 

Our study also examines biases of similar representations across positive and negative personality traits, revealing the subtleties of how these traits are depicted. While individuals of nonwhite races appear linked with positive attributes such as vigor, ambition, striving, and independence, they are also associated with negative traits like detachment, hardheartedness, and conceitedness. 

Representation of geographical locations in everyday scenarios 

Another aspect of bias that we studied pertains to the representation of diverse geographical locations in how models interpret everyday scenarios. We did this using such prompts as “a photo of a birthday party” or “a photo of a library.” Although it is difficult to discern the precise location of a generated photo, distinctions in these representations can still be measured between using a general prompt and a prompt that specifies a location, for example, “a photo of a birthday party in Colombia.” In the paper, we describe this experiment for the two most populous countries in each inhabited continent, considering everyday scenarios centering around events, places, food, institutions, community, and clothing. When models were given a general prompt, overall results indicated that images generated for countries like Nigeria, Papua New Guinea, and Ethiopia had the greatest difference between the prompt and the image, while images generated for Germany, the US, and Russia were the closest aligned to the general prompt. 

Subtle effects of using expanded prompts 

Many bias mitigation techniques rely on expanding the prompt to enrich and diversify the images that models generate. To tackle bias in AI-generated images, we applied prompt engineering (opens in new tab) to increase the likelihood that the image will reflect what’s specified in the prompt. We used prompt expansion, a type of prompt engineering, to add further descriptors to the initial general prompts and guide the model towards unbiased content. An example of prompt expansion would be “a portrait of a female doctor” instead of “a portrait of a doctor.” Our experiments proved that prompt expansion is predominantly effective in creating more specified content in AI-generated images. However, there are also unintended outcomes, particularly in terms of decreased diversity and image quality, as shown in Figure 4. 

Examples of generation output from DALL-E v2 for two prompts: “a portrait of an announcer” and “a portrait of a female announcer.”
Figure 4. Expanded prompts using descriptors like “female” can indeed yield more diverse depictions, but often at the cost of image variety and quality. 

Safeguarding against bias in T2I models

As T2I generation models become increasingly integrated into our digital ecosystems, it is paramount that we remain vigilant to the biases they may inadvertently perpetuate. This research underscores the profound importance of continually evaluating and refining these models. We hope that the outcomes and methodology presented in this study provide valuable insights for evaluating and building new generative models. We would like to emphasize the importance of fostering responsible development and ensuring representational fairness in this process. 

The post Understanding social biases through the text-to-image generation lens appeared first on Microsoft Research.

Read More

Incorporating chemists’ insight with AI models for single-step retrosynthesis prediction

Incorporating chemists’ insight with AI models for single-step retrosynthesis prediction

Retrosynthesis -

Retrosynthesis analysis is a critical task in organic chemistry and central to many important industries. It primarily involves decomposing a target molecule into commercially available molecules step by step. Since synthesis strategies can be quite diverse and strategic, retrosynthesis planning with expert knowledge has long been considered an “art.”

Recently, machine learning-based approaches have achieved promising results on this task, particularly in single-step retrosynthesis prediction. In retrosynthesis, a molecule can be represented as either a 2D graph or a 1D SMILES (simplified molecular-input line-entry system) sequence. SMILES is a notation system used to represent chemical structures using plain text, which consists of a sequence of characters to describe the arrangement of atoms, bonds, and rings within a molecule. SMILES can be considered a traversal on the corresponding molecular graph, as shown in Figure 1.

Retrosynthesis -
Figure 1: An example of molecular graph and SMILES string

Given the representations of molecules, most machine learning-based approaches employ encoder-decoder frameworks, where the encoder part encodes the molecular (the target product) sequence or graph as high dimensional vectors, and the decoder takes the output from the encoder and generates the output sequence (the predicted reactant) token-by-token autoregressively. 

Casting retrosynthesis analysis as a sequence decoding problem enables the use of deep neural architectures that are well-developed in machine translation or graph neural networks. While AI has made significant strides in predicting reactants, it’s crucial to acknowledge the expertise of human chemists. In real-world route scouting tasks, synthetic chemists rely on their professional experience and abstract understanding of underlying mechanisms. They often start with molecular substructures or fragments that are chemically similar to target molecules, providing clues for a series of chemical reactions that may yield the target product.

Our paper, Single-step retrosynthesis prediction by leveraging commonly preserved substructures (opens in new tab), proposes a novel approach that leverages commonly preserved substructures in organic synthesis. This approach incorporates chemists’ insight in retrosynthesis, bringing the AI model closer to the way human experts think.

Substructure extraction and modeling

In the context of organic chemistry, “substructures” refer to molecular fragments or smaller building blocks that are chemically similar or preserved within target molecules. These substructures serve as essential components for understanding the assembly of complex molecules and play a significant role in retrosynthesis analysis. 

Based on this concept, our framework consists of three main modules:

  1. Reaction Retrieval: This module retrieves similar reactions, given a product molecule as a query. It uses a learnable cross-lingual memory retriever to align reactants and products in high-dimensional vector space.
  2. Substructure Extraction: We extract the common substructures from the product molecule and the top cross-aligned candidates, based on molecular fingerprints. These substructures provide a reaction-level, fragment-to-fragment mapping between reactants and products.
  3. Substructure-level Sequence-to-Sequence Learning: We convert the original token-level sequence to a substructure-level sequence. The new input sequence includes the SMILES strings of the substructures followed by the SMILES strings of other fragments with virtual number labels. The output sequences are the fragments with virtual numbers. The virtual numbers are used to indicate the bond breaking/connecting site.
Retrosynthesis -
Figure 2: Method overview, with virtual number labeled atoms and substructures highlighted in green.

Unlike most existing work, our model only needs to predict the fragments connected to the substructure, thereby simplifying the prediction task, with the substructure part remaining unchanged. 

In the example shown in Figure 2, the substructure “COC(=O)Cc1cc2ccc(F)cc2[2cH]c1C.C[1SH](=O)=O” remains unchanged, and the model only needs to predict that the fragment “[2BH]2OC(C)(C)C(C)(C)O2.[1cH]1ccc(Br)nc1”. The substructure SMILES and the predicted fragment SMILES are then combined to form a complete reactants SMILES.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


Retrosynthesis prediction

We analyzed our method using the USPTO full dataset (opens in new tab) and compared it to other notable works in the field. In almost every scenario, our method achieved comparable or better top-1 accuracy compared to previously tested methods. On the subset of data where substructures were successfully extracted, model performance significantly improved compared to the overall result. 

The improvement in our method can be attributed to two main factors:

  1. Our method managed to successfully extract substructures from 82.2% of all products on the USPTO full test dataset, demonstrating the general applicability of this approach. 
  2. We only needed to generate fragments connected to virtually labeled atoms in the substructures, which shortened the string representations of molecules and significantly lowered the number of atoms to be predicted.
Retrosynthesis -
Figure 3: Product molecule specific substructures. These reactants all contain phthalimide, with substructures highlighted in green.

A key aspect of our method for one-step retrosynthesis is the extraction of product-specific substructures. By doing so, we can better capture subtle structural changes from reactants to products that are unique to each reaction. Take phthalimide, a common heterocyclic substructure, as an example. We analyzed four exemplary reactions where the reactants contain phthalimide (see Figure 3). The extracted substructures vary among different reaction types, demonstrating the product-specific nature of the substructures.

In reaction (a) and reaction (b), phthalimide is not considered part of the substructure because it incorporates the reaction. However, in reaction (c) and reaction (d), the substructures are different, yet they both contain phthalimide. These results show that substructures are indeed product-specific, which aligns with our expectations.

Incorporating human insights into decision-making 

In addition, leveraging commonly preserved substructures offers another benefit: providing users with valuable insights for decision-making in retrosynthesis planning. When compared to existing methods, our approach can help human experts assess potential pathways and eliminate infeasible reactions using their chemistry knowledge. 

For each input product molecule, we extract multiple substructures from retrieved reactions, (see details in our paper) and for some cases, not all substructures are correct. As such, we can group predictions by substructures. As shown in Figure 4, the predicted groups of reactants and reactions offer valuable information to experts. For instance, they can refine predictions by comparing reactions associated with retrieved candidates, making our predictions more explainable and trustworthy compared to existing “black-box” models.

Retrosynthesis -
Figure 4: Substructures and predictions grouped by substructures. The retrieved candidate reactants (#2, #3 and #4) indicate that the substructures extracted from the retrieved reactant #1 are likely incorrect, because the triple bond is likely a reaction site. The extracted substructures are highlighted in green.

We hope that our work will spark interest in this fast-growing and highly interdisciplinary area of retrosynthesis prediction and other related topics. By pushing the boundaries of what’s possible in chemistry and machine learning, we can continue to make strides in understanding complex chemical reactions and designing more efficient retrosynthetic strategies.

The post Incorporating chemists’ insight with AI models for single-step retrosynthesis prediction appeared first on Microsoft Research.

Read More

Rethinking trust in direct messages in the AI era

Rethinking trust in direct messages in the AI era

Rethinking trust in direct messages in the AI era - blog hero showing a flowchart diagram

This blog post is a part of a series exploring our research in privacy, security, and cryptography. For the previous post, see https://www.microsoft.com/en-us/research/blog/research-trends-in-privacy-security-and-cryptography. While AI has the potential to massively increase productivity, this power can be used equally well for malicious purposes, for example, to automate the creation of sophisticated scam messages. In this post, we explore threats AI can pose for online communication ecosystems and outline a high-level approach to mitigating these threats.

Communication in the age of AI

Concerns regarding the influence of AI on the integrity of online communication are increasingly shared by policymakers, AI researchers, business leaders, and other individuals. These concerns are well-founded, as benign AI chatbots can be easily repurposed to impersonate people, help spread misinformation, and sway both public opinion and personal beliefs. So-called “spear phishing” attacks, which are personalized to the target, have proved devastatingly effective. This is particularly true if victims are not using multifactor authentication, meaning an attacker who steals their login credentials with a phishing mail could access authentic services with those credentials. This opportunity has not been missed by organized cybercrime; AI-powered tools marketed to scammers and fraudsters are already emerging. This is disturbing, because democratic systems, business integrity, and interpersonal relationships all hinge on credible and effective communication—a process that has notably migrated to the digital sphere.

As we enter a world where people increasingly interact with artificial agents, it is critical to acknowledge that these challenges from generative AI are not merely hypothetical. In the context of our product offerings at Microsoft, they materialize as genuine threats that we are actively addressing. We are beginning to witness the impact of AI in generating highly specific types of text (emails, reports, scripts, code) in a personalized, automated, and scalable manner. In the workplace, AI-powered tools are expected to bring about a huge increase in productivity, allowing people to focus on the more creative parts of their work rather than tedious, repetitive details. In addition, AI-powered tools can improve productivity and communication for people with disabilities or among people who do not speak the same language.  

In this blog post, we focus on the challenge of establishing trust and accountability in direct communication (between two people), such as email, direct messages on social media platforms, SMS, and even phone calls. In all these scenarios, messaging commonly takes place between individuals who share little or no prior context or connection, yet those messages may carry information of high importance. Some examples include emails discussing job prospects, new connections from mutual friends, and unsolicited but important phone calls. The communication may be initiated on behalf of an organization or an individual, but in either case we encounter the same problem: if the message proves to be misleading, malicious, or otherwise inappropriate, holding anyone accountable for it is impractical, may require difficult and slow legal procedures, and does not extend across different communication platforms. 

As the scale of these activities increases, there is also a growing need for a flexible cross-platform accountability mechanism that allows both the message sender and receiver to explicitly declare the nature of their communication. Concretely, the sender should be able to declare accountability for their message and the receiver should be able to hold the sender accountable if the message is inappropriate.

Elements of accountability 

The problems outlined above are not exactly new, but recent advances in AI have made them more urgent. Over the past several years, the tech community, alongside media organizations and others, have investigated ways to distinguish whether text or images are created by AI; for example, C2PA is a type of watermarking technology, and one possible solution among others. With AI-powered tools increasingly being used in the workplace, Microsoft believes that it will take a combination of approaches to provide the highest value and most transparency to users. 

Focusing on accountability is one such approach. We can start by listing some properties we expect of any workable solution:

  • People and organizations need to be able to declare accountability for the messages they send. 
  • Receivers need to be able to hold the senders accountable if the message is inappropriate or malicious, to protect future potential victims. 
  • There must exist an incentive for the sender to declare accountability. 
  • The mechanism should only solve the accountability problem and nothing else. It must not have unintended side effects, such as a loss of privacy for honest participants. 
  • Receivers should not be required to register with any service. 
  • The accountability mechanism must be compatible with the plurality of methods people use to communicate today.

One way to build an accountability mechanism is to use a reputation system that verifies real-world identities, connecting our digital interactions to a tangible and ultimately accountable organization or human identity. Online reputation has now become an asset that organizations and individuals have a vested interest in preserving. It creates an incentive for honest and trustworthy behavior, which ultimately contributes to a safer and more reliable digital environment for everyone.

Reputation system for online accountability 

Consider what an online communication user experience could be like with an integrated reputation system. In this solution, a message sender could declare their accountability by binding their message to their account in the reputation system in the form of a cryptographic reputation tag. Conversely, the receiver uses the tag to verify the sender’s reputation and can use it to report the sender if the message is inappropriate, reducing the sender’s reputation. It is the sender’s responsibility to judge whether the receiver will perceive the message as inappropriate. 

Messages with an attached reputation tag are called reputed messages, whereas those without an associated reputation are called generic messages. Reputed messages would typically make the most sense in one-to-one communication that the sender intends for a particular recipient, or one-to-many communication to a few recipients. For example, a proposal to discuss a business deal, a wedding invitation email, a payment reminder SMS from a company’s billing department, or a work email discussing a joint project might be sent as reputed messages. Generic messages would typically not be intended for a particular receiver. For example, emails sent to a mailing list (many receivers) or non-personalized advertisements (large scale) should be sent as generic. 

The different components and workflows of our accountability mechanism are depicted, at a high level, in Figure 1.

system diagram
Figure 1: An accountability mechanism design, showing both the account creation and message sending/reporting workflows.

Taking a concrete example, think of a situation where you receive an email from your bank asking you to verify the security settings for your account. You know that phishing emails often target such scenarios, so your first reaction is to ignore the message. However, in this case your email client has noted the valid reputation tag and automatically moved the email to a reputed messages folder. It shows the sender’s reputation, high, next to the message. Instead of deleting the unsolicited and slightly suspicious email, you decide to check whether the link in the email truly leads you to your bank’s website. You are now convinced this is a legitimate message and proceed with the recommendations to review your security settings. 

As another example, suppose you work in your company’s billing department. You find something wrong with a customer’s billing information and decide to send them an email to get more information. Since this is an important matter, you hope to maximize the chance of them seeing your message by attaching the billing department’s reputation tag to it. The customer sees the email go in the reputed messages folder, notices the sender’s high reputation, and responds to it with appropriate urgency.

As a third example, imagine that you receive an unsolicited phone call from someone who claims to be your distant relative and wants to discuss a family reunion they are organizing. They ask you questions about your family, making you slightly uneasy. Right before calling you, they sent you a reputation tag via SMS encoding their reputation and the context of their call. You verify that the tag is valid, but that their reputation is medium. You decide to end the call and report them using the tag they shared, as you felt that their call asking for such sensitive information was inappropriate. 

These examples highlight that this single system can be used across many different modes of communication, from emails to social media messages to phone calls, fostering trust and safety across the entire landscape of direct communication methods in use today.

Microsoft Research Podcast

AI Frontiers: Models and Systems with Ece Kamar

Ece Kamar explores short-term mitigation techniques to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 


Call to action

In this blog post we have attempted to outline a solution to an already existing problem that is exacerbated by modern AI. Capturing the core of this problem is not easy, and many of the previously proposed solutions have unintended consequences that make them unworkable. For example, we explained why approaches that attempt to limit the use of AI are unlikely to succeed. 

The solutions are not easy either. The messaging ecosystem is vastly complex and any solution requiring fundamental changes to that are unlikely to be acceptable. Usability is a key concern as well: if the system is only designed to communicate risk, we may want to avoid inadvertently communicating safety, much like the presence of padlock symbols as a sign of HTTPS have caused confusion and underestimation of risk for web browser users (opens in new tab)

Is there a comprehensive identity framework that would connect real-world identities to digital identities? This connection to a unique real-world identity is crucial, as otherwise anyone could simply create as many distinct reputation accounts as they need for any nefarious purpose.

For organizations, the situation is easier, because countries and states tend to hold public records that establish their existence and “identity.” For individuals, platforms like Reddit, TripAdvisor, and Stack Overflow have built reputation systems for their internal use, but without a foundational layer that confirms unique human identities these cannot be used to solve our problem, just as Facebook’s “real name” policy and X Premium (formerly Twitter Blue) have been insufficient to prevent the creation and use of fake accounts. Still, this is not an impossible problem to solve: LinkedIn is already partnering with CLEAR (opens in new tab) to bind government ID verification to a verification marker in user profiles, and with Microsoft Entra Verified ID (opens in new tab) to verify employment status. Worldcoin (opens in new tab) is building a cryptocurrency with each wallet being linked to a unique real-world person through biometrics, and Apple recently announced Optic ID (opens in new tab) for biometric authentication through their Vision Pro headset.

Whenever we talk about identities—especially real-world identities—we need to talk about privacy. People use different digital identities and communication methods in different communities, and these identities need to be kept separate. Trusting a reputation system with such sensitive information requires careful consideration. Our preliminary research suggests that techniques from modern cryptography can be used to provide strong security and privacy guarantees so that the reputation system learns or reveals nothing unnecessary and cannot be used in unintended ways. 

What about the governance of the reputation system? In an extreme case, a single centralized party hosts the system while providing cryptographic transparency guarantees of correct operation. In another extreme, we should explore whether a purely decentralized implementation can be feasible. There are also options between these two extremes; for example, multiple smaller reputation systems hosted by different companies and organizations. 

These open questions present an opportunity and a responsibility for the research community. At Microsoft Research, we are diligently working on aspects of this problem in partnership with our research on privacy-preserving verifiable information and identity, secure hardware, transparency systems, and media provenance. We invite the rest of the research community to join in by either following the path we outlined here or suggesting better alternatives. This is the start of a broad exploration that calls for a profound commitment and contribution from all of us.

The post Rethinking trust in direct messages in the AI era appeared first on Microsoft Research.

Read More

AI Frontiers: AI in India and beyond with Sriram Rajamani

AI Frontiers: AI in India and beyond with Sriram Rajamani

AI Frontiers with Sriram Rajamani; black and white photo of Sriram Rajamani, Managing Director of Microsoft Research India, next to the Microsoft Research Podcast

Episode 146 | August 31, 2023

Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come. 

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as healthcare and education, and its potential to benefit humanity.

This episode features Sriram Rajamani, Distinguished Scientist and Managing Director of Microsoft Research India. Rajamani talks about how the lab’s work is being influenced by today’s rapidly advancing AI. One example? The development of a conversational agent in India capable of providing information about governmental agricultural programs in farmers’ natural language, particularly significant in a country with more than 30 languages, including 22 government-recognized languages. It’s an application Microsoft CEO Satya Nadella described as the “mic drop moment” of his trip to the lab early this year.

Transcript

[MUSIC PLAYS]

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more fortunate to work in the field than at this moment. The development of increasingly powerful large-scale AI models like GPT-4 is accelerating the advancement of AI. These models and the systems they power are exhibiting surprising new abilities like reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’m sharing conversations with fellow researchers about the latest developments in large AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.

Today, I’ll speak with Sriram Rajamani, Managing Director of Microsoft Research India. For nearly 20 years, this lab has focused on interdisciplinary research, blending theory and practice and computer science with social science. Our researchers in India have made many contributions to advance AI in areas like causal reasoning, but the latest wave of powerful AI models has made a profound impact on all the lab’s work, including their approach to creating technologies for underserved communities.


[MUSIC FADES]

All right, so, Sriram, let’s dive right in. I think it’s fairly obvious for me to say at this point that ChatGPT—and generative AI more broadly—is a worldwide phenomenon. But what’s so striking to me about this is the way that so many people around the world can pick up the technology and use it in their context, in their own way. I was on a panel discussion a few weeks ago where I saw a comedian discover in real time that GPT-4 could write jokes that are actually funny. And shortly after that, I spoke to a student who was using ChatGPT to write an application to obtain a grazing permit for cattle. You know, the work of your lab is situated in its own unique societal context. So, what I really want to know and start with here today is, like, what’s the buzz been like for you in your part of the world around this new wave of AI?

SRIRAM RAJAMANI: Yeah. First of all, Ashley, you know, thank you for having this conversation with me. You’re absolutely right that our lab is situated in a very unique context on how this technology is going to play out in, you know, this part of the world, certainly. And you might remember, Ashley, a sort of a mic drop moment that happened for Satya [Nadella] when he visited India earlier this year, in January. So one of our researchers, Pratyush Kumar—he’s also co-founder of our partner organization called AI4Bhārat—he works also with the government on a project called Bhashini, which the government endeavors to bring conversational AI to the many Indian languages that are spoken in India. And what Pratyush did was he connected some of the AI4Bhārat translation models, language translation models, together with one of the GPT models to build a bot for a farmer to engage and ask questions about the government’s agricultural programs so the farmer could speak in their own language—you know, it could be Hindi—and what the AI4Bhārat models would do is to convert the Hindi speech into text and then translate it into English. And then he taught, you know, either fine-tuned or integrated with augmented generation … I don’t … I’m not … I don’t quite remember which one … it was one of those … where he made a GPT model customized to understand the agricultural program of the government. And he chained it together with this speech recognition and translation model. And the farmer could just now talk to the system, the AI system, in Hindi and ask, you know, are they eligible for their benefits and many details. And the, and the model had a sensible conversation with him, and Satya was just really amazed by that, and he calls … he called that as the mic drop moment of his trip in India, which I think is indicative of the speed at which this disruption is impacting very positively the various parts of the world, including the Indian subcontinent.

LLORENS: You referenced the many Indian languages written and spoken. Can you just bring, bring that to life for us? How many, how many languages are we talking about?

RAJAMANI: So, I think there are at least, you know, 30 or 40, you know, main, mainstream languages. I mean, the government recognizes 22. We call them as IN22. But I would think that there are about 30-plus languages that are spoken very, very broadly, each of them with, you know, several tens of millions, hundreds of millions of speakers. And then there is a long tail of maybe a hundred more languages which are spoken by people with … in, in smaller population counts. The real … they’re also very low-resource languages like Gondi and Idu Mishmi, which are just spoken by maybe just only a million speakers or even under a million speakers who probably … those languages probably don’t have enough data resources. So, India is an amazing testbed because of this huge diversity and distribution of languages in terms of the number of speakers, the amount of available data, and, and many of these tail languages have unique, you know, sociocultural nuances. So I think in that sense, there’s a really good testbed for, you know, how conversational AI can inclusively impact the entire world.

LLORENS: And, and what’s the … you mentioned tail languages. And so maybe we mean they’re low-resource languages like you also mentioned. What’s the gap like between what languages AI is accessible in today versus the full extent of all those languages that you just described, even just for, you know, for the Indian subcontinent? 

RAJAMANI: So what is … what we’re seeing is that with IN22, the top languages, if you look at successive versions of the GPT models, for example, the performance is definitely improving. So if you just go from, you know, GPT-2 to GPT-3 to 3.5 to 4, right, you can sort of see that these models are increasingly getting capable. But still there is a gap between what these models are able to do and what custom models are able to do, particularly if you go towards languages in which there’s not enough training data. So, so people in our lab, you know, are doing very systematic work in this area. There is a benchmarking work that my colleagues are doing called MEGA, where there is systematic benchmark being done on various tasks on a matrix that consists of, you know, tasks on one axis and languages on another axis to just systematically, empirically study, you know, what these models are able to do. And also, we are able to build models to predict how much more data is needed in each of these languages in order for the performance to be comparable to, say, languages like English. What is the … what is the gap, and how much data is needed? The other thing is that it turns out that these models, they, they learn also from related languages. So if you want to improve the performance of a language, it turns out there are other languages in the world and in India that have similar characteristics, you know, syntactic and semantic characteristics, to the language that you’re thinking about. So we can also sort of recommend, you know, what distribution of data we should collect so that all the languages improve. So that’s the kind of work that we’re doing.

LLORENS: Yeah, it’s one of the most fascinating parts of all of this—how diversity in the training dataset improves, you know, across the board, like even the addition of code, for example, in addition to language, and now we’re even seeing even other modalities. And, you know, the, the wave of AI and the unprecedented capabilities we’re seeing has significant implications for just about all of computing research. In fact, those of us in and around the field are undergoing now a process that I call, you know, reimagining computing research. And, you know, that’s a somewhat artful way to put it. But beyond the technical journey, there’s an emotional journey happening across the research community and many other communities, as well. So what has that journey been like for you and the folks at the India lab?

RAJAMANI: Yeah, that’s a good question, Ashley. You know, our work in the lab spans four areas. You know, we do work in theory and algorithms. We do work in AI and machine learning. We do systems work, and we also have an area called “Technology and Empowerment.” It’s about making sure that technology benefits people. And so far, our conversation has been about the last area. But all these four areas have been affected in a big way using this disruption. Maybe, maybe I’ll just say a few more things about the empowerment area first and then move on to the other ones. If you look at our work in the empowerment area, Ashley, right, this lab has had a track record of doing work that makes technology inclusive not just from an academic perspective, but by also deploying the work via spun-off startups, many startups, that have taken projects in the lab and scaled them to the community. Examples are Digital Green, which is an agricultural extension; 99DOTS, which is a tuberculosis medication adherence system. Karya is a, is a platform for dignified digital labor to enable underprivileged users, rural users, to contribute data and get paid for it. You know, HAMS is a system that we have built to improve road safety. You know, we’ve built a system called BlendNet that enables rural connectivity. And almost all of these, we have spun them off into startups that are … that have been funded by, you know, venture capitalists, impact investors, and we have a vibrant community of these partners that are taking the work from the lab and deploying them in the community. So the second thing that is actually happening in this area is that, as you may have heard, India is playing a pivotal role in digital public infrastructure. Advances like the Aadhaar biometric authentication system; UPI, which is a payment systemthey are pervasively deployed in India, and they reach, you know, several hundreds of millions of people. And in the case of Aadhaar, more than a billion people and so on. And the world is taking note. India is now head of the G20, and many countries now want to be inspired by India and build such a digital public infrastructure in their own countries, right. And so, so, so what you saw is the mic drop moment, right? That … it actually has been coming for a long time. There has been a lot of groundwork that has been laid by our lab, by our partners, you know, such as AI4Bhārat, the people that work on digital public goods to get the technical infrastructure and our know-how to a stage where we can really build technology that benefits people, right. So, so going forward, in addition to these two major advancements, which is the building of the partner and alumni ecosystem, the digital public good infrastructure, I think AI is going to be a third and extremely important pillar that is going to enable citizen-scale digital services to reach people who may only have spoken literacy and who might speak in their own native languages and the public services can be accessible to them. 

LLORENS: So you mentioned AI4Bhārat, and I’d love for you to say a bit more about that organization and how researchers are coming together with collaborators across sectors to make some of these technology ideas real.

RAJAMANI: Yeah. So AI4Bhārat is a center in IIT Madras, which is an academic institution. It has multiple stakeholders, not just Microsoft Research, but our search technology center in India also collaborates with them. Nandan Nilekani is a prominent technologist and philanthropist. He’s behind a lot of India’s digital public infrastructure. He also, you know, funds that center significantly through his philanthropic efforts. And there are a lot of academics that have come together. And what the center does is data collection. I talked about the diversity of, you know, Indian languages. They collect various kinds of data. They also look at various applications. Like in the judicial system, in the Indian judicial system, they are thinking about, you know, how to transcribe, you know, judgments, enabling various kinds of technological applications in that context, and really actually thinking about how these kinds of AI advances can help right on top of digital public goods. So that’s actually the context in which they are working on. 

LLORENS: Digital public goods. Can you, can you describe that? What, what do we mean in this context by digital public good?

RAJAMANI: So what we mean is if you look at Indian digital public infrastructure, right, that is, as I mentioned, that is Aadhaar, which is the identity system that is now enrolled more than 1.3 billion Indians. There is actually a payment infrastructure called UPI. There are new things that are coming up, like something that’s, that’s called Beckn. There’s something called ONDC that is poised to revolutionize how e-commerce is done. So these are all, you know, sort of protocols that through private-public partnership, right, government together with think tanks have developed, that are now deployed in a big way in India. And they are now pervasively impacting education, health, and agriculture. And every area of public life is now being impacted by these digital public infrastructures. And there is a huge potential for AI and AI-enabled systems to ride on top of this digital public infrastructure to really reach people. 

LLORENS: You know, you talked about some of the, you know, the infrastructure considerations, and so what are the challenges in bringing, you know, digital technologies to, you know, to, to the Indian context? And, and you mentioned the G20 and other countries that are following the patterns. What are, what are some of the common challenges there?

RAJAMANI: So, I mean, there are many, many challenges. One of them is lack of access. You know, though India has made huge strides in lifting people out of poverty, people out there don’t have the same access to technology that you and I have. Another challenge is awareness. People just don’t know, you know, how technology can help them, right. You know, people hearing this podcast know about, you know, LinkedIn to get jobs. They know about, you know, Netflix or other streaming services to get entertainment. But there are many people out there that don’t even know that these things exist, right. So awareness is another issue. Affordability is another issue. So … many of the projects that I mentioned, what they do is actually they start not with the technology; they start with the users and their context and this situation, and what they’re trying to do and then map back. And technology is just really one of the pieces that these systems, that all of these systems that I mentioned, right … technology is just only one component. There’s a sociotechnical piece that deals with exactly these kinds of access and awareness and these kinds of issues. 

LLORENS: And we’re, we’re kind of taking a walk right now through the work of the lab. And there are some other areas that you, you want to get into, but I want to come back to this … maybe this is a good segue into the emotional journey part of the question I asked a few minutes ago. As you get into some of the, you know, the deep technical work of the lab, what were some of the first impressions of the new technologies, and what were, what were some of the first things that, you know, you and your colleagues there and our colleagues, you know, felt, you know, in observing these new capabilities?

RAJAMANI: So I, I think Peter [Lee] mentioned this very eloquently as stages of grief. And me and my colleagues, I think, went through the same thing. I mean, the … there was … we went from, you know, disbelief, saying, “Oh, wow, this is just amazing. I can’t believe this is happening” to sort of understanding what this technology can do and, over time, understanding what its limitations are and what the opportunities are as a scientist and technologist and engineering organization to really push this forward and make use of it. So that’s, I think, the stages that we went through. Maybe I can be a little bit more specific. As I mentioned, the three other areas we work on are theory in algorithms, in machine learning, and in systems. And I can sort of see … say how my colleagues are evolving, you know, their own technical and research agendas in the, in the light of the disruption. If you take our work in theory, this lab has had a track record of, you know, cracking longstanding open problems. For example, problems like the Kadison-Singer conjecture that was open for many years, many decades, was actually solved by people from the lab. Our lab has incredible experts in arithmetic and circuit complexity. They came so close to resolving the VP versus VNP conjecture, which is the arithmetic analog of the P versus NP problem. So we have incredible people working on, working on theoretical computer science, and a lot of them are now shifting their attention to understanding these large language models, right. Instead of understanding just arithmetic circuits, you know, people like Neeraj Kayal and Ankit Garg are now thinking about mathematically what does it take to understand transformers, how do we understand … how might we evolve these models or training data so that these models improve even further in performance in their capabilities and so on. So that’s actually a journey that the theory people are going through, you know, bringing their brainpower to bear on understanding these models foundationally. Because as you know, currently our understanding of these foundation models is largely empirical. We don’t have a deep scientific understanding of them. So that’s the opportunity that the, that the theoreticians see in this space. If you look at our machine learning work, you know, that actually is going through a huge disruption. I remember now one of the things that we do in this lab is work on causal ML … Amit Sharma, together with Emre Kiciman and other colleagues working on causal machine learning. And I heard a very wonderful podcast that you hosted them some time ago. Maybe you can say a little bit about what, what you heard from them, and then I can pick up back and then connect that with the rest of the lab. 

LLORENS: Sure. Well, it’s … you know, I think the, the common knowledge … there’s, there’s so many, there’s so many things about machine learning over the last few decades that have become kind of common knowledge and conventional wisdom. And one of those things is that, you know, correlation is not causation and that, you know, you know, learned models don’t, you know, generally don’t do causal reasoning. And so we, you know, we’ve had very specialized tools created to do the kind of causal reasoning that Amit and Emre do. And it was interesting. I asked them some of the same questions I’m asking you now, you know, about the journey and the initial skepticism. But it has been really interesting to see how they’re moving forward. They recently published a position paper on arXiv where they conducted some pretty compelling experiments, in some cases, showing something like, you know, causal reasoning, you know, being, being exhibited, or at least I’ll say convincing performance on causal reasoning tasks. 

RAJAMANI: Yeah, absolutely.

LLORENS: Yeah, go ahead.

RAJAMANI: Yeah, yeah, yeah, absolutely. So, so, so, you know, I would say that their journey was that initially they realized that … of course, they build specialized causal reasoning tools like DoWhy, which they’ve been building for many years. And one of the things they realized was that, “Oh, some of the things that DoWhy can do with sophisticated causal reasoning these large language models were just able to do out of the box.” And that was sort of stunning for them, right. And so the question then becomes, you know, does specific vertical research in causal reasoning is even needed, right. So that’s actually the shock and the awe and the emotional journey that these people went through. But actually, after the initial shock faded, they realized that there is actually [a] “better together” story that is emerging in the sense that, you know, once you understand the details, what they realized was that natural language contains a lot of causal information. Like if you just look at the literature, the literature has many things like, you know, A causes B and if there is, if there is, you know, hot weather, then ice cream sales go up. You know, this information is present in the literature. So if you look at tools like DoWhy, what they do is that in order to provide causal machine learning, they need assumptions from the user on what the causal model is. They need assumptions about what the causal graph is, what is the user’s assumptions about which variables depend on which variables, right? And then … and, and, and what they’ve realized is that models like GPT-4 can now provide this information. Previously, only humans were able to provide this information. And … but in addition to that, right, tools like DoWhy are still needed to confirm or refute these assumptions, statistically, using data. So this division of labor between getting assumptions from either a human or from a large language model and then using the mathematics of DoWhy to confirm or refute the assumptions now is emerging as a real advance in the way we do causal reasoning, right? So I think that’s actually what I heard in your podcast, and that’s indicative of actually what the rest of my colleagues are going through. You know, moving from first thinking about, “Oh, GPT-4 is like a threat, you know, in the sense that it really obviates my research area” to understand, “Oh, no, no. It’s really a friend. It, it really helps me do, you know, some of the things that required primarily human intervention. And if I combine GPT or these large language models together with, you know, domain specific research, we can actually go after bigger problems that we didn’t even dare going after before.” 

LLORENS: Mmm. Let me, let me ask you … I’m going to, I’m going to pivot here in a moment, but did you … have you covered, you know, the areas of research in the lab that you wanted to walk through?

RAJAMANI: Yeah, yeah, there’s, there’s more. You know, thank you for reminding me. Even in the machine learning area, there is another work direction that we have called extreme classification, which is about building very, very … classifiers with a large number of labels, you know, hundreds of millions and billions of labels. And, you know, these people are also benefiting from large language encoders. You know, they have come up with clever ways of taking these language encoders that are built using self-supervised learning together with supervised signals from things like, you know, clicks and logs from search engines and so on to improve performance of classifiers. Another work that we’ve been doing is called DiskANN, or approximate nearest neighbor search. As you know, Ashley, in this era of deep learning, retrieval works by converting everything in the world, you know, be it a document, be it an image, you know, be it an audio or video file, everything into an embedding, and relevance … relevant retrieval is done by nearest neighbor search in a geometric space. And our lab has been doing … I mean, we have probably the most scalable vector index that has been built. And, and, and these people are positively impacted by these large language models because, you know, as you know, retrieval augmented generation is one of the most common design patterns in making these large language models work for applications. And so their work is becoming increasingly relevant, and they are being placed huge demands on, you know, pushing the scale and the functionality of the nearest neighbor retrieval API to do things like, oh, can I actually add predicates, can I add streaming queries, and so on. So they are just getting stretched with more demand, you know, for their work. You know, if you look at our systems work, which is the last area that I want to cover, you know, we have, we have been doing work on using GPUs and managing GPU resources for training as well as inference. And this area is also going through a lot of disruption. And prior to these large language models, these people were looking at relatively smaller models, you know, maybe not, you know, hundreds of billions to trillions of parameters. But, but, you know, maybe hundreds of millions and so on. And they invented several techniques to share a GPU cluster among training jobs. So the disruption that they had was all these models are so large that nobody is actually sharing clusters for them. But it turned out that some of the techniques that they invented to deal with, you know, migration of jobs and so on are now used for failure recovery in very, very large models. So it turns out that, you know, at the beginning it seems like, “Oh, my work is not relevant anymore,” but once you get into the details, you find that there are actually still many important problems. And the insights you have from solving problems for smaller models can now carry over to the larger ones. And one other area I would say is the area of, you know, programming. You know, I myself work in this area. We have been doing … combining machine learning together with program analysis to build a new generation of programing tools. And the disruption that I personally faced was that the custom models that I was building were no longer relevant; they’re, they’re not even needed. So that was a disruption. But actually, what me and my colleagues went through was that, “OK, that is true, but we can now go after problems that we didn’t dare to go before.” Like, for example, you know, we can now see that, you know, copilot and so on let you give recommendations in the context of the particular file that you are editing. But can we now edit an entire repository which might contain, you know, millions of files with hundreds of millions of code? Can I just say, let’s take, for example, the whole of the Xbox code base or the Windows code base, and in the whole code base, I want to do this refactoring, or I want to, you know, migrate this package from … migrate this code base from using now this serialization package to that serialization package. Can we just do that, right? I think we wouldn’t even dare going after such a problem two years ago. But now with large language models, we are thinking, can we do that? And large language models cannot do this right now because, you know, whatever context size you have, you can’t have 100-million-line code as a context to a large language model. And so this requires, you know, combining program analysis with these techniques. That’s as an example. And actually, furthermore, there are, you know, many things that we are doing that are not quite affected by large language models. You know, for example, Ashley, you know about the HyWay project, where we’re thinking about technology to make hybrid work work better. And, you know, we are doing work on using GPUs and accelerators for, you know, database systems and so on. And we do networking work. We do a low-earth orbit satellite work for connectivity and so on. And those we are doubling down, you know, though, they have nothing to do with large language models because those are problems that are important. So, I think, you know, to summarize, I would say that, you know, most of us have gone through a journey from, you know, shock and awe to sort of somewhat of an insecurity, saying is my work even relevant, to sort of understanding, oh, these things are really aides for us. These are not threats for us. These are really aides, and we can use them to solve problems that we didn’t even dream of before. That’s the journey I think my colleagues have gone through.

LLORENS: I want to, I want to step into two of the concepts that you just laid out, maybe just to get into some of the intuitions as to what problem is being solved and how generative AI is sort of changing the way that those, those problems are solved. So the first one is extreme classification. I think, you know, a flagship use of generative AI and foundation models is, is Bing chat. And so I think this idea of, of internet search as a, as a, you know, as a, a home for, for these new technologies is, is in the popular imagination now. And I know that extreme classification seeks to solve some challenges related to search and information retrieval. But what is the challenge problem there? What, you know … how is extreme classification addressing that, and how is that, you know, being done differently now? 

RAJAMANI: So as I mentioned, where my colleagues have already made a lot of progress is in combining language encoders with extreme classifiers to do retrieval. So there are these models called NLR. Like, for example, there’s a tooling NLR model, which is a large language model which does representation, right. It actually represents, you know, keywords, keyword phrases, documents, and so on in the encodings, you know, based on, you know, self-supervised learning. But it is a very important problem to combine the knowledge that these large language models have, you know, from understanding a text. We have to combine that with supervised signals that we have from click logs. Because we have search engine click logs, we know, you know, for example, when somebody searches for this information and we show these results, what users click on. That’s supervised signals, and we have that in huge amounts. And what our researchers have done is they have figured out how to combine these encoders together with the supervised signals from click logs in order to improve both the quality and cost of retrieval, right. And, Ashley, as you said, retrieval is an extremely important part of experiences like Bing chat and retrieval augmented generation is what prevents hallucination and grounds these large language models with appropriate information retrieved and presented so that the, the relevant results are grounded without hallucination, right. Now, the new challenge that this team is now facing is, OK, that’s so far so good as far as retrieval is concerned, right? But can we do similar things with generation, right? Can we now combine these NLG models, which are these generative models, together with supervised signals, so that even generation can actually be guided in this manner, improved in both performance, as well as accuracy. And that is an example of a challenging problem that the team is going after.

LLORENS: Now let’s do the same thing with programming, and maybe I’m going to engage you on a slightly higher level of abstraction than the deep work you’re doing. And then, we can, we can, we can get back down into the work. But one of the things … one of, one of the, one of the popular ideas about these new foundation models is that you can … effectively through interacting with them, you’re sort of programming them in natural language. How does that concept sit with you as someone who, you know, is an expert in programming languages? What do you, what do you think, what do you think when someone says, you know, sort of programming the, you know, the system in natural language?

RAJAMANI: Yeah, so I, I find it fascinating and, you know, for one, you know, can we … an important topic in programming language research has been always that can we get end users or, you know, people who are nonprogrammers to program. I think that has been a longstanding open problem. And if you look at the programming language community, right, the programming language community has been able to solve it only in, in narrow domains. You know, for example, Excel has Flash Fill, where, through examples, you know, people can program Excel macros and so on. But those are not as general as these kinds of, you know, LLM-based models, right. And, and it is for the whole community, not just me, right. It was stunning when users can just describe in natural language what program they want to write and these models emit in a Python or Java or C# code. But there is a gap between that capability and having programmers just program in natural language, right. Like, you know, the obvious one is … and I can sort of say, you know, write me Python code to do this or that, and it can generate Python code, and I could run it. And if that works, then that’s a happy path. But if it doesn’t work, what am I supposed to do if I don’t know Python? What am I supposed to do, right? I still have to now break that abstraction boundary of natural language and go down into Python and debug Python. So one of the opportunities that I see is then can we build representations that are also in natural language, but that sort of describe, you know, what the application the user is trying to build and enable nonprogrammers—could be lawyers, could be accountants, could be doctors—to engage with a system purely in natural language and the system should talk back to you, saying, “Oh, so far this is what I’ve understood. This is the kind of program that I am writing,” without the user having to break that natural language abstraction boundary and going and having to go and understand Python, right? I think this is a huge opportunity in programming languages to see whether … can we build, like, for example, right, Ashley, right, I’m a programmer, and one of the things I love about programming is that I can write code. I can run it, see what it produces, and if I don’t like the results, I can go change the code and rerun it. And that’s sort of the, you know, coding, evaluating … we call it the REPL loop, right. So that’s, that’s what a programmer faces, right. Can we now provide that to natural language programmers? And since …  and I want to say, “Here’s a program I want to write,” and now I want to say, “Well, I want to run this program with this input.” And if it doesn’t work, I want to say, “Oh, this is something I don’t like. I want to change this code this way,” right. So can I now provide that kind of experience to natural language programming? I think that’s a huge opportunity if you managed to pull that off.

LLORENS: And now let’s, let’s maybe return to some of the more societally oriented, you know, topics that, that you were talking about at the top of the episode in the context of, of, of programming. Because being able to program in natural language, I think, really changes, you know, who can use the technologies, who can develop technologies, what a program … what a software development team can actually be, and who, who that, who that kind of a team can consist of. So can you paint a picture? You know, what, what, what kind of opportunities for, for, you know, software development does this open up when you can sort of program in natural languages, assuming we can make the AI compatible with your language, whatever that happens to be?  

RAJAMANI: Yeah, I think there are a lot of opportunities, and maybe I’ll, I’ll, I’ll describe a few things that we’re already doing. My, my colleagues are working on a project called VeLLM, which is now a copilot assistant for societal-scale applications. And one application they are going after is education. So, you know, India, like many other countries, has made a lot of educational resources available to teachers in government schools and so on so that if a teacher wants to make a lesson plan, you know, there is enough information available for them to search, find out many videos that their colleagues have created from different parts of the country, and put them together to create a lesson plan for their class, right. But that is a very laborious process. I mean, you have information overload when you deal with it. So my colleagues are thinking about, can we now think about, in some sense, the teacher as a programmer and have the teacher talk to the VeLLM system saying, “Hey, and here is my lesson plan. Here is what I’m trying to put together in terms of what I want to teach. And I now want the AI system to collect the relevant resources that are relevant to my lesson plan and get them in my language, the language that my students speak. You know, how do I do that,” right? And all of the things that I mentioned, right, you have to now index all of the existing information using vector indices. You have to now [use] retrieval augmented generation to get the correct thing. You have to now deal with the trunk and tail languages because this teacher might be speaking in, in, in a language that is not English, right. And, and, and, and the teacher might get a response that they don’t like, right. And how do they now … but they are not a programmer, right?  How are they going to deal with it, right? So that’s actually an example. If we, if we pull this off, right, and a teacher in rural India is able to access this information in their own language and create a lesson plan which contains the best resources throughout the country, right, we would have really achieved something.

LLORENS: Yeah, you know, it’s a, it’s a hugely compelling vision. And I’m really looking forward to seeing where you and, you know, our colleagues in Microsoft Research India Lab and MSR [Microsoft Research] more broadly, you know, take all these different directions.

[MUSIC PLAYS] So I really appreciate you spending this time with me today.

RAJAMANI: Thank you, Ashley. And I was very happy that I could share the work that my colleagues are doing here and, and bringing this to your audience. Thank you so much.

The post AI Frontiers: AI in India and beyond with Sriram Rajamani appeared first on Microsoft Research.

Read More

Building a “heavy metal quartet” of AI compilers

Building a “heavy metal quartet” of AI compilers

By MSR Editor 

Compilation is an important process in program development, in which a program called a compiler translates source code written in a programming language into machine code executable on computer hardware. As AI technology and large-scale AI models become increasingly prevalent across the digital world, their unique characteristics are posing new challenges for compilers.

As AI models have evolved from early versions like recurrent neural networks (RNN) and convolutional neural networks (CNN) to more recent iterations like Transformer, their fundamental architecture is also constantly evolving. Meanwhile, the underlying hardware accelerators, such as graphics processing units (GPUs) and neural processing units (NPUs), are iterating rapidly as well, with some designs disrupting previous architectures. Therefore, an AI compiler plays a critical role in helping new AI models run efficiently on new hardware.

In response, researchers from Microsoft Research, in collaboration with academic colleagues, conducted a series of research and released the “heavy-metal quartet” of AI compilers: Rammer, Roller, Welder, and Grinder[1]. This quartet provides systematic and innovative solutions for current mainstream AI models and hardware compilation.

The left diagram shows the unified compiler abstraction with a tile-based intermediate representation (IR) as the core. The right diagram shows the four core AI compilation technologies.
Figure 1: The four core AI compilation technologies based on unified tile abstraction

Microsoft Research Podcast

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

Emre Kiciman and Amit Sharma discuss their paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” and how it examines the causal capabilities of large language models (LLMs) and their implications.


AI compilation “Rammer” improves hardware parallel utilization

Deep neural networks (DNNs) are widely adopted in image classification, natural language processing, and many other intelligence tasks. Because of their importance, many computing devices such as CPUs, GPUs, and specially designed DNN accelerators are being used to perform DNN computations. One key variable for DNN computation efficiency is scheduling, which determines the order in which computational tasks are performed on hardware. Conventional AI compilers typically treat DNN computation as a data flow graph where each node represents a DNN operator. These operators are implemented as opaque library functions and are scheduled to run on the accelerator separately. At the same time, this process also relies on another layer of schedulers, usually implemented in hardware, to take advantage of the parallelism available in operators. This two-level approach incurs significant scheduling overhead and often does not fully utilize hardware resources.

To address this issue, researchers proposed a new DNN compiler, Rammer, which can optimize the execution of DNN workloads on massive-parallel units of accelerators. Rammer imagines the scheduling space for AI compilation as a two-dimensional plane, where computational tasks are “bricks” that can be divided into different shapes and sizes. The purpose of scheduling in Rammer is to arrange these bricks tightly—as if building a wall—on the computational units of the two-dimensional plane. The arrangement should not leave any gaps, which would hurt hardware utilization and thus reduce execution speed. Rammer works like a compactor in this two-dimensional space: when a DNN program is translated into bricks, Rammer can place them on different computing units of the accelerator to compact them.

A schematic diagram illustrating Rammer’s technical framework. The input to Rammer is a data-flow graph where a node is an rOperator. Then, Rammer introduces rTask-aware DFG compiler to manage the inter and intra-operator scheduling in one place. The rTask-aware DFG compiler will generate a static execution plan for runtime execution. Rammer abstracts a hardware accelerator as a virtualized parallel device (vDevice), which includes multiple virtualized execution units (vEUs). The vDevice provides the scheduling and synchronization capabilities at the rTask level so that the rProgram can be mapped to the corresponding vEUs at compile time. The vEUs, together with the vDevice will be mapped to the hardware at runtime.
Figure 2: Rammer’s technical framework

In other words, Rammer generates an efficient static spatiotemporal schedule for DNNs ahead of time (during compilation), minimizing runtime scheduling overhead. Meanwhile, through new hardware-independent abstractions for computing tasks and hardware accelerators, Rammer exposes a larger scheduling space and provides a novel way to implement cooperative intra- and inter-operator scheduling. This allows Rammer to find more efficient schedules, thereby greatly improving hardware utilization.

Researchers evaluated Rammer on multiple devices, including NVIDIA GPUs, AMD GPUs, and Graphcore intelligence processing units (IPUs). Experiments have shown that Rammer significantly outperforms state-of-the-art compilers, such as XLA and TVM, on NVIDIA and AMD GPUs, achieving a speedup of up to 20.1 times. And compared to TensorRT, NVIDIA’s proprietary DNN inference library, Rammer achieves a speedup of up to 3.1 times.

AI compilation “Roller” improves compilation efficiency

An accelerator is equipped with parallel computing units and multiple layers of memory hierarchy. The data needs to be passed upwards layer by layer from the bottom memory layer before computation. At each layer, the data is divided into smaller bricks. Eventually, these smaller bricks are handed over to the top-level processor for computation. The challenge lies in how to partition the data and fill the memory space with large bricks, so as to better utilize available memory and improve efficiency. The current approach involves using machine learning to identify better strategies for partitioning these bricks. However, this typically requires thousands of search steps, each of which is evaluated on the accelerator, in order to find a satisfactory solution. As a result, the process can take days or even weeks to compile a full AI model.

Given the computational logic and the specification of each memory layer, which present a holistic view on the software and hardware information, it is possible to formulate the best strategy for partitioning the bricks, as well as the best brick sizes. This enables faster compilation with good computation efficiency. And it is the key idea behind Roller. Like a road roller, the system lays down high-dimensional tensor data onto two-dimensional memory like tiling a floor, finding the optimal tile sizes given the memory characteristics. At the same time, it encapsulates the tensor shape that aligns with the hardware characteristics of the underlying accelerator, achieving efficient compilation by limiting the choices for shapes.

A schematic diagram illustrating Roller’s technical framework. Roller takes an operator described as a tensor expression. Roller extracts the tensor shapes from the tensor expression and leverage hardware specifications to construct rTiles. Based on rTiles, Roller proposes a scale-up-then-scale-out recursive construction algorithm to generate efficient tensor programs (named rProgram) that describes the data processing pipeline. When generating rProgram, the construction algorithm identifies good rTile configurations by evaluating the performance of a constructed rProgram through a micro-performance model. It is built on top a device described through a hardware abstraction layer exposing only rTile-related interfaces: Load, Compute, and Store. The constructed rProgram is finally realized through a code generator to emit the final kernel code corresponding to the specific device.
Figure 3: Roller’s technical framework

Evaluations on six mainstream DNN models and 119 popular DNN operators demonstrated that Roller can generate highly optimized kernels in seconds, especially for large and expensive custom operators. Roller achieves a three-orders-of-magnitude improvement in compilation time compared to existing compilers. The performance of the kernels generated by Roller is comparable to that of state-of-the-art tensor compilers, including DNN libraries, with some operators performing even better. Roller has also been used in customizing DNN kernels internally, which has demonstrated its real improvement in development agility.

AI compilation “Welder” optimizes memory access and improves computing efficiency

With the growing demand for processing higher fidelity data and the use of faster computing cores in newer hardware accelerators, modern DNN models are becoming increasingly memory intensive. A disparity between underutilized computing cores and saturated memory bandwidth has been observed in various popular DNN models.

For example, profiling on a state-of-the-art DNN benchmark shows that the memory bandwidth utilization can be as high as 96.7% while the average utilization of computing cores is only 51.6%. Even more seriously, the continuous evolution of hardware and DNN models continues to increase this gap. Modern AI models tend to process high-fidelity data, such as larger images, longer sentences, and higher-resolution graphics. Such data demands higher memory bandwidth during computation. Additionally, the introduction of more efficient specialized computing cores (such as NVIDIA Tensor Cores or AMD Matrix Cores) further increases memory pressure.

To address this issue, the researchers proposed the Welder deep learning compiler, which holistically optimizes the memory access efficiency of the end-to-end DNN model. Represented as a data flow graph, the end-to-end DNN computation involves multiple stages, where the input data is divided into blocks that flow through different operators. These blocks are transferred to processor cores for computation and then transferred back to memory. This results in significant overhead due to data movement across memory layers. Since it includes multiple stages, the entire process can be envisioned as a scenario where “workers” are moving bricks upwards layer by layer. The first worker takes the bricks up, processes them, and then puts them back in their original location. The second worker takes them up again, sculpts them, and then once again puts them back. The process continues with the third worker, the fourth worker, and so on, repeatedly moving the bricks. However, this leads to significant overhead. Would it be possible for the first worker to finish a part of the subtask and then directly hand it over to the next worker at the top level? These tasks can then be “welded” together to achieve a pipelined operation with higher efficiency. Welder plays the role of such a welding tool. By connecting (welding) different operators, data blocks are processed in the manner of an assembly line, greatly reducing memory access traffic at lower-level memory layers. With AI models imposing increasingly high requirements for memory efficiency in recent years, Welder helps to significantly improve computational efficiency.

A schematic diagram illustrating Welder’s technical framework. Welder takes a full DNN model as input and converts it into a data-flow graph of tile-based computing tasks, which is called tile-graph. Then, a two-step scheduling algorithm, i.e., graph connecting and sub-graph scheduling, is proposed to recursively decide an efficient tile-graph execution plan for multiple memory layers, known as a hierarchical tile-graph. Finally, this plan is then mapped to an executable code for a specific hardware accelerator using four abstracted computing interfaces defined in the hardware layer.
Figure 4: Welder’s technical framework

Evaluations on 10 mainstream DNN models, (including classic and the latest AI model structures for various tasks, such as vision, natural language processing, 3D graphics, etc.), demonstrated that Welder significantly exceeds the performance of existing mainstream frameworks and compilers on both NVIDIA and AMD GPUs. For example, it outperforms PyTorch, ONNXRuntime, and Ansor by up to 21.4 times, 8.7 times, and 2.8 times, respectively. Welder’s automatic optimization surpasses even TensorRT and Faster Transformer (a hand-crafted library), achieving speedups of up to 3.0 times and 1.7 times, respectively. Furthermore, when running these models on hardware with faster computing cores such as TensorCore, performance is improved even more, underscoring the significance of memory optimization for future AI accelerators. 

AI compilation “Grinder” allows efficient control flow execution on accelerators

In AI computation, the movement of data blocks sometimes requires more complex control logic, i.e., control flow code. For example, a program could iteratively traverse each word in a sentence or dynamically determine which part of a program to execute based on input. Currently, most AI compilers focus on addressing data flow execution efficiency and do not provide efficient support for control flow. As a result, models with more complex control flow cannot effectively utilize accelerator performance. The researchers realized that control flow and data flow can be segmented and reorganized in order to execute more efficiently. Their solution is Grinder, which acts like a portable grinding and cutting machine. After cutting the data flow into parallel computing blocks of different sizes, it then integrates (grinds) control flow into data flow, so that control flow can also be executed efficiently on the accelerator.

A schematic diagram illustrating Grinder’s technical framework. The example loop structure is scheduled as a uProgram mapped on the 3-level accelerator. The uProgram consists of 4 loop-uTasks for 4 L1-Units resepectively and each loop-uTask is mapped to a L1-Unit for execution. Both the data flow operators and the loop are scheduled into the loop-uTasks.
Figure 5: Grinder’s technical framework

Grinder can jointly optimize the execution of control flow and data flow on hardware accelerators and unify the representation of AI models, including both control flow and data flow, through uTask, a new abstraction. This allows Grinder to expose the overall scheduling space for rescheduling control flow to lower levels of hardware parallelism. Grinder uses a heuristic strategy to find an effective scheduling scheme and can automatically move control flow into device kernels, thereby achieving optimizations across control flow boundaries. Experiments have shown that Grinder can achieve up to an 8.2x speedup on control flow-intensive DNN models, making it the fastest among DNN frameworks and compilers for control flow. 

These four AI compilers, based on a common compiler abstraction and unified intermediate representation (IR), solve multiple fundamental problems in current AI compilers, including parallelism, compilation efficiency, memory, and control flow. Together they constitute a comprehensive set of solutions for compilation. and have played an important role in the customization and optimization of new AI models within Microsoft Research.

Jilong Xue, Principal Researcher at MSR Asia, summed up the project this way:

“On one hand, AI compilers must perform extreme optimizations like operator fusion and kernel specialization tailored for hardware resources. On the other hand, they must also provide systematic compilation support for new, large-scale hardware architectures, such as AI chips featuring on-chip network interconnection (NoC) or hybrid memory architectures, and even guiding hardware design using white-box compilation technologies. The AI compilers we developed have demonstrated a substantial improvement in AI compilation efficiency, thereby facilitating the training and deployment of AI models. At the same time, the evolution of large-scale models also presents opportunities for the next generation AI compiler. In the future, these large-scale models themselves may inherently assist in achieving optimization and compilation.”

The following researchers have contributed to this project:

(In alphabetical order) Wei Cui, Yuxiao Guo, Wenxiang Hu, Lingxiao Ma, Youshan Miao, Ziming Miao, Yuqing Xia, Jilong Xue, Fan Yang, Mao Yang, Lidong Zhou


[1] Grinder is the research project name. However, this system is referred to as Cocktailer in the paper.

The post Building a “heavy metal quartet” of AI compilers appeared first on Microsoft Research.

Read More